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PREFACE 


Since  its  founding  in  1952,  the  Advisory  Group  for  Aerospace  Research  and  Development  has  published,  through  the 
Right  Mechanics  Panel,  a  number  of  standard  texts  in  the  field  of  flight  testing.  The  original  Right  Test  Manual  was 
published  in  the  years  1954  to  1956.  The  Manual  was  divided  into  four  volumes:  I.  Performance,  II.  Stability  and  Control, 

III.  Instrumentation  Catalog,  and  IV.  Instrumentation  Systems. 

As  a  result  of  developments  in  the  field  of  flight  test  instrumentation,  the  Right  Test  Instrumentation  Group  of  the 
Right  Mechanics  Panel  was  established  in  1 968  to  update  Volumes  III  and  IV  of  the  Flight  Test  Manual  by  the  publication  of 
the  Flight  Test  Instrumentation  Series,  AGARDograph  160.  In  its  published  volumes  AGARDograph  160  has  covered 
recent  developments  in  flight  test  instrumentation. 

In  1 978,  the  Right  Mechanics  Panel  decided  that  further  specialist  monographs  should  be  published  covering  aspects 
of  Volume  I  and  II  of  the  original  Flight  Test  Manual,  including  the  flight  testing  of  aircraft  systems.  In  March  1981,  the 
Right  Test  Techniques  Group  was  established  to  carry  out  this  task.  The  monographs  of  this  Series  (with  the  exception  of 
AG  237  which  was  separately  numbered)  are  being  published  as  individually  numbered  volumes  of  AGARDograph  300.  At 
the  end  of  each  volume  of  AGARDograph  300  two  general  Annexes  are  printed;  Annex  1  provides  a  list  of  the  volumes 
published  in  the  Flight  Test  Instrumentation  Series  and  in  the  Right  Test  Techniques  Series.  Annex  2  contains  a  list  of 
handbooks  that  are  available  on  a  variety  of  flight  test  subjects,  not  necessarily  related  to  the  contents  of  the  volume 
concerned. 

Special  thanks  and  appreciation  are  extended  to  Mr  F.N.Stoliker  (US),  who  chaired  the  Group  for  two  years  from  its 
inception  in  1981,  established  the  ground  rules  for  the  operation  of  the  Group  and  marked  the  outlines  for  future 
publications. 

In  the  preparation  of  the  present  volume  the  members  of  the  Flight  Test  Techniques  Group  listed  below  have  taken  an 
active  part.  AGARD  has  been  most  fortunate  in  finding  these  competent  people  willing  to  contribute  their  knowledge  and 
time  in  the  preparation  of  this  volume. 
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NOMENCLATURE 


SYMBOLS 

It  1$  Infraction  to  list  all  of  the  sywbols  used  In  this  document.  The  following  are  syufcols  of  partlc 
ular  significance  and  those  used  consistently  In  large  portions  of  the  document.  In  several  specialized 
situations,  the  same  symbols  are  used  with  different  meanings  not  Included  In  this  list. 

A  stability  matrix 

B  control  matrix 

b(.)  bias 

C  state  observation  matrix 

D  control  observation  matrix 

Et.l  expected  value 

e  error  vector 

F( . )  System  function 

FF*  process  noise  covariance  matrix 

Fx(.)  probability  distribution  function  of  x 

f(.)  system  state  function 

GG*  measurement  noise  covariance  matrix 

g(. )  system  observation  function 

h(.)  equation  error  function 

J(.)  cost  function 

h  Fisher  Information  matrix 

m?  prior  mean  of  c 

n.|,n(t)  process  noise  vector 

P  prior  covariance  of  e,  or  covariance  of  filtered  x 

p(x)  probability  density  function  of  x,  short  notation 

px ( . )  probability  density  function  of  x,  full  notation 

Q  covariance  of  predicted  x 

R  covariance  of  Innovation 

t  time 

U  system  Input 

^  ,u(t)  dynamic  system  Input  vector 

Vj  concatenated  Innovation  vector 

v  Innovation  vector 

x  parameter  vector  In  static  models 

x1 ,x(t)  dynamic  system  state  vector 

Z  system  response 

Zt  concatenated  response  vector 

z,j,z(t)  dynamic  system  response  vector 

4  sample  Interval 

n.|  measurement  noise  vector 

*  state  transition  matrix 


T 


Input  transition  matrix 
vactor  of  unknown  parameters 
set  of  possible  parameter  values 
u  random  noise  vactor 

n  probability  space 

predicted  estimate  (In  filtering  contexts) 

*  optimum  (In  optimization  contexts),  or  estimate  (»n  estimation  contexts),  or  filtered  estimate  (In 

filtering  contexts) 

smoothed  estimate 

Subscript  e  Indicates  dependence  on  c 
Abbreviations  and  acronyms 

arg  max  value  of  x  that  maximizes  the  following  function 
x 

corr  correlation 

cov  covariance 

exp  exponential 

In  natural  logarithm 

MAP  maximum  a  posteriori  probability 

MLE  maximum-likelihood  estimator 

mse  mean-square  error 

var  variance 

Mathematical  notation 

f(.)  the  entire  function  f,  as  opposed  to  the  value  of  the  function  at  a  particular  point 

*  transpose 

vx  gradient  with  respect  to  the  vector  x  (result  Is  a  row  vector  when  the  operand  Is  a  scalar,  or  a 

matrix  when  the  operand  Is  a  column  vector) 

v*  second  gradient  with  respect  to  x 

7 

X  series  sunmatlon 

n  series  product 

ir  3.14159... 

u  set  union 

n  set  Intersection 

c  subset 

e  element  of  a  set 

(x:c)  the  set  of  all  x  such  that  condition  c  holds 
<.,.)  Inner  product 

|  conditioned  on  (In  probability  contexts) 

|.|  absolute  value  or  determinant 

d|.|  volume  element 

tj  right-hand  limit  at  t^ 

n-vector  vector  with  n  elements 


IDENTIFICATION  OF  DYNAMIC  SYSTEMS 


Richard  E.  Mailt* 
Atrotpac*  Engineer 

and 

Kenneth  W.  1 1  iff 
Senior  Staff  Scientist 

NASA  A  mat  Raaaarch  Cantar 
Orydtn  Flight  Raaaarch  Facility 
Edwardt,  California  MM3 


.SUMMARY  OF  CONTENTS 

The  subject  of  system  Identification  Is  too  broad  to  be  covered  completely  In  one  book.  This  document 
Is  restricted  to  statistical  system  Identification;  that  Is,  methods  derived  from  probabilistic  mathematical 
statements  of  the  problem.  We  will  be  primarily  Interested  In  maximum-likelihood  and  related  estimators. 
Statistical  methods  are  becoming  Increasingly  Important  with  the  proliferation  of  high-speed,  general-purpose 
digital  computers.  Problems  that  were  once  solved  by  hand-plotting  the  data  and  drawing  a  line  through  them 
are  now  done  by  telling  a  computer  to  fit  the  best  line  through  the  data  (or  by  some  completely  different, 
formerly  Impractical  method).  Statistical  approaches  to  system  Identification  are  well-suited  to  computer 
application. 

Automated  statistical  algorithms  can  solve  more  complicated  problems  more  rapidly- and  sometimes  more 
accurately- than  the  older  manual  methods.  There  Is  a  danger,  however,  of  the  engineer's  losing  the  Intuitive 
feel  for  the  system  that  arises  from  long  hours  of  working  closely  with  the  data.  To  use  statistical  estima¬ 
tion  algorithms  effectively,  the  engineer  must  have  not  only  a  good  grasp  of  the  system  under  analysis,  but 
also  a  thorough  understanding  of  the  analytic  tools  used.  The  analyst  must  strive  to  understand  how  the 
system  behaves  and  what  characteristics  of  the  data  Influence  the  statistical  estimators  In  order  to  evaluate 
the  validity  and  meaning  of  the  results. 

Our  primary  aim  In  this  document  Is  to  provide  the  practicing  data  analyst  with  the  background  necessary 
to  make  effective  use  of  statistical  system  Identification  techniques,  particularly  maximum-likelihood  and 
related  estimators.  The  Intent  Is  to  present  the  theory  In  a  manner  that  aids  Intuitive  understanding  at  a 
concrete  level  useful  In  application.  Theoretical  rigor  has  not  been  sacrificed,  but  we  have  tried  to  avoid 
"elegant"  proofs  that  may  require  three  lines  to  write,  but  3  years  of  study  to  comprehend  the  underlying 
theory.  In  particular,  such  theoretically  Intriguing  subjects  as  martingales  and  measure  theory  are  Ignored. 
Several  excellent  volumes  on  these  subjects  are  available.  Including  Balakrlshnan  (1973),  Royden  (1968),  Rudln 
(1974),  and  Kushner  (1971). 

We  assume  that  the  reader  has  a  thorough  background  In  linear  algebra  and  calculus  (Paige,  Swift,  and 
Slobko,  1974;  Apostol ,  1969;  Nerlng,  1969;  and  Wilkinson,  1965),  Including  complete  familiarity  with  matrix 
operations,  vector  spaces.  Inner  products,  norms,  gradients,  eigenvalues,  and  related  subjects.  The  reader 
should  be  familiar  with  the  concept  of  function  spaces  as  types  of  abstract  vector  spaces  (Luenberger,  1969), 
but  does  not  need  expertise  In  functional  analysis.  We  also  assume  familiarity  with  concepts  of  deterministic 
dynamic  systems  (Zadeh  and  Desoer,  1963;  Wlberg,  1971;  and  Levan,  1983). 

Chapter  1  Introduces  the  basic  concepts  of  system  Identification.  Chapter  2  Is  an  introduction  to  numeri¬ 
cal  optimization  methods,  which  are  Important  to  system  Identification.  Chapter  3  reviews  basic  concepts  from 
probability  theory.  The  tr  atment  Is  necessarily  abbreviated,  and  previous  familiarity  with  probability 
thecry  Is  assumed. 

Chapters  4-10  present  the  body  of  the  theory.  Chapter  4  defines  the  concept  of  an  estimator  and  some  of 
the  basic  properties  of  estimators.  Chapter  5  discusses  estimation  as  a  static  problem  In  which  time  is  not 
Involved.  Chapter  6  presents  some  simple  results  on  stochastic  processes.  Chapter  7  covers  the  state  estima¬ 
tion  problem  for  dynamic  systems  with  known  coefficients.  We  first  pose  it  as  a  static  estimation  problem, 
drawing  on  the  results  from  Chapter  5.  We  then  show  how  a  recursive  formulation  results  In  a  simpler  solution 
process,  arriving  at  the  same  state  estimate.  The  derivation  used  for  the  recursive  state  estimator  (Kalman 
filter)  does  not  require  a  background  in  stochastic  processes;  only  basic  probability  and  the  results  from 
Chapter  5  are  used. 

Chapters  8-10  present  the  parameter  estimation  problem  for  dynamic  systems.  Each  chapter  covers  one  of 
the  basic  estimation  algorithms.  We  have  considered  parameter  estimation  as  a  problem  in  its  own  right,  rather 
than  forcing  It  into  the  form  of  a  nonlinear  filtering  problem.  The  general  nonlinear  filtering  problem  is 
more  difficult  than  parameter  estimation  for  linear  systems,  and  it  requires  ad  hoc  approximations  for  practi¬ 
cal  Implementation.  We  feel  that  our  approach  is  more  natural  and  Is  easier  to  understand. 

Chapter  11  examines  the  accuracy  of  the  estimates.  The  emphasis  in  this  chapter  is  on  evaluating  the 
accuracy  and  analyzing  causes  of  poor  accuracy.  The  chapter  also  Includes  brief  discussions  about  the  roles 
of  model  structure  determination  and  experiment  design. 
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CHAPTER  1 


1.0  INTRODUCTION 

System  Identification  1$  broadly  defined  as  the  deduction  of  system  characteristics  from  measured  data. 

It  Is  coenonly  referred  to  as  an  Inverse  problem  because  It  Is  the  opposite  of  the  problem  of  computing  the 

response  of  a  system  Mlth  known  characteristics.  Gauss  (1809,  p.  85)  refers  to  "the  Inverse  problem,  that  Is 
when  the  true  Is  to  be  derived  from  the  apparent  place."  The  Inverse  problem  might  be  phrased  as,  "Given  the 
answer,  what  was  the  question?"  Phrased  In  such  general  terms,  system  identification  is  seen  as  a  simple 
concept  used  In  everyday  life,  rather  than  as  an  obscure  area  of  mathematics. 

Example  1 . 0- 1  The  system  Is  your  body,  and  the  characteristic  of  Interest  Is 

Its  mass.  You  perform  an  experiment  by  placing  the  system  on  a  mechanical 

transducer  In  the  bathroom  which  gives  as  output  a  position  approximately 
proportional  to  the  system  mass  and  the  local  gravitational  field.  Based  on 
previous  comparisons  with  the  doctor's  scales,  you  know  that  y«ur  scale  con¬ 
sistently  reads  2  1b  high,  so  you  subtract  this  figure  from  the  reading.  The 
result  Is  still  somewhat  higher  than  expected,  so  you  step  off  of  the  scales 
and  then  repeat  the  experiment.  The  new  reading  Is  more  "  easonable"  and  from 
It  you  obtain  an  estimate  of  the  system  mass. 

This  simple  example  actually  Includes  several  Important  principles  of  system  Identification;  for  Instance, 
the  resulting  estimates  are  biased  (as  defined  In  Chapter  4). 

Example  1.0-2  The  "guess  your  weight"  booth  at  the  fair. 

The  weight  guesser's  Instrumentation  and  estimation  algorithm  are  more  difficult  to  describe  precisely, 
but  they  are  used  to  solve  the  same  system  Identification  problem. 

Example  1.0-3  Newton's  deduction  of  the  theory  of  gravity. 

Newton's  problem  was  much  more  difficult  than  the  first  two  examples.  He  had  to  deduce  not  just  a  single 
number,  but  also  the  form  of  the  equations  describing  the  system.  Newton  was  a  true  expert  In  system  identi¬ 
fication  (among  other  things). 

As  apparent  from  the  above  examples,  system  Identification  Is  as  much  an  art  as  a  scleice.  This  point  Is 
often  forgotten  by  scientists  who  prove  elegant  mathematical  theorems  about  a  model  that  doesn't  adequately 
represent  the  true  system  to  begin  with.  On  the  other  hand,  engineers  who  reject  what  they  consider  to  be 
"Ivory  tower  theory"  are  foregoing  tools  that  could  give  definite  answers  to  some  questions,  and  hints  to  aid 
in  the  understanding  of  others. 

System  identification  is  closely  tied  to  control  theory,  partially  by  some  cornnon  methodology,  and  par¬ 
tially  by  the  use  of  identified  system  models  for  control  design.  Before  you  can  design  a  controller  for  a 
system,  you  must  have  some  notion  of  the  equations  describing  the  system. 

Another  common  purpose  of  system  Identification  is  to  help  gain  an  understanding  of  how  a  system  works. 
Newton's  investigations  were  more  along  this  line.  (It  Is  unlikely  that  he  wanted  to  control  the  motion  of 
the  planets.) 

The  application  of  system  Identification  techniques  is  strongly  dependent  on  the  purpose  for  which  the 
results  are  intended;  radically  different  system  models  and  identification  techniques  may  be  appropriate  for 
different  purposes  related  to  the  same  system.  The  aircraft  control  system  designer  will  be  unimpressed  when 
given  a  model  based  on  inputs  that  cannot  be  Influenced,  outputs  that  cannot  be  measured,  aspects  of  the 
system  that  the  designer  does  not  want  to  control,  and  a  complicated  model  in  a  form  not  amenable  to  control 
analysis  techniques.  The  same  model  might  be  ideal  for  the  aerodynamlcist  studying  the  flow  around  the 
vehicle.  The  first  and  most  Important  step  of  any  system  Identification  application  is  to  define  Its  purpose. 

Following  this  chapter's  overview,  this  document  presents  one  aspect  of  the  science  of  system  Identlflca- 
tlon-the  theory  of  statistical  estimation.  The  theory's  main  purpose  Is  to  help  the  engineer  understand  the 
system,  not  to  serve  as  a  formula  for  consistently  producing  the  required  results.  Therefore,  our  exposition 
of  the  theory,  although  rigorously  defensible,  emphasizes  Intuitive  understanding  rather  than  mathematical 
sophistication.  The  following  conments  of  Luenberger  (1969,  p.  2)  also  apply  to  the  theory  of  system 
Identification: 


Some  readers  may  look  with  great  expectation  toward  functional  analysis,  hoping 
to  discover  new  powerful  techniques  that  will  enable  them  to  solve  Important 
problems  beyond  the  reach  of  simpler  mathematical  analysis.  Such  hopes  are 
rarely  realized  In  practice.  The  primary  utility  of  functional  analysis. . .is 
Its  role  as  a  unifying  discipline,  gathering  a  number  of  apparently  diverse, 
specialized  mathematical  tricks  into  one  or  a  few  geometric  principles. 

With  good  Intuitive  understanding,  which  arises  from  such  unification,  the  reader  will  be  better  equipped  to 
extend  the  Ideas  to  other  areas  where  the  solutions,  although  simple,  were  not  formerly  obvious. 

The  literature  of  the  field  often  uses  the  terms  'system  Identification,"  "parameter  identification,"  and 
"parameter  estimation"  Interchangeably.  The  following  sections  define  and  differentiate  these  broad  terms. 

The  majority  of  the  literature  In  the  field,  Including  most  o f  this  document,  addresses  the  field  mest  pre¬ 
cisely  called  parameter  estimation. 
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1.1  SYSTEM  IDENTIFICATION 

We  begin  by  phraslnc  the  system  Identification  problem  In  formal  mathematical  terms.  There  are  three 
elements  essential  to  a  system  Identification  problem:  a  system,  an  experiment,  and  a  response.  We  define 
these  elements  here  In  broad,  abstract,  set-theoretic  terms,  before  Introducing  more  concrete  f^rms  In 
Section  1.3. 

Let  U  represent  some  experiment,  taken  from  the  set  ©  of  possible  experiments  on  the  system. 

U  could  represent  a  discrete  event,  such  as  stepping  on  the  scales:  or  a  value,  such  os  a  voltage  applied. 

U  could  also  be  a  vector  function  of  time,  such  as  the  motions  of  the  control  surfaces  while  an  airplane  Is 
flown  through  a  maneuver.  In  systems  terminology,  U  Is  the  Input  to  the  system.  (We  will  use  the  terms 
"Input,"  "control,"  and  "experiment"  more  or  less  Interchangeably.) 

Observe  the  response  Z  of  the  system  to  the  experiment.  As  with  U,  Z  could  be  represented  In  many 
forms  Including  as  a  discrete  event  (e.g.,  "the  system  blew  up")  or  as  a  measured  time  function.  It  Is  an 
element  of  the  set  (T)  of  possible  responses.  (We  also  use  the  terms  "output"  or  "measurement"  for  Z.) 

The  abstract  system  Is  a  map  (function)  F  from  the  set  of  possible  experiments  to  the  set  of  possible 
responses. 


F:  ®  -  ®  (1.1-1) 

that  Is 

Z  =  F(U)  (1.1-2) 

The  system  Identification  problem  Is  to  reconstruct  the  function  F  from  a  collection  of  experiments 
Ui  and  the  corresponding  system  responses  Zi.  This  Is  the  purest  form  of  the  "black  box"  Identification 
problem.  We  are  asked  to  identify  the  system  with  no  Information  at  all  about  Its  internal  structure,  as  If 
the  system  were  In  a  black  box  which  we  could  not  see  Into.  Our  only  Information  Is  the  Inputs  and  outputs. 

An  obvious  solution  Is  to  perform  all  of  the  experiments  In  (0)  and  simply  tabulate  the  responses.  This 
Is  usually  Impossible  because  the  set  (0)  Is  too  large  (typically,  Infinite).  Also,  we  may  not  have  complete 
freedom  in  selecting  the  Uf.  Furthermore,  even  If  this  approach  were  possible,  the  tabular  format  of  the 
result  would  generally  be  inconvenient  and  of  little  help  in  understanding  the  structure  of  the  system. 

If  we  cannot  perform  all  of  the  experiments  In  (0),  the  system  identification  problem  Is  Impossible 
without  further  Information.  Since  we  have  made  no  assumptions  about  the  form  of  F,  we  cannot  be  sure  oi  Its 
behavior  without  checking  every  point. 


Example  1.1-1  The  Input  U  and  output  Z  of  a  system  are  both  represented 
by  real -valued  scalar  variables.  When  an  Input  of  1.0  Is  applied,  the  output 
Is  1.0.  When  an  Input  of  -1.0  Is  applied,  the  output  Is  also  1.0.  Without 
further  Information  we  cannot  tell  which  of  the  following  representations  (or 
an  Infinite  number  of  others)  of  the  system  is  correct. 


a)  Z  »=  1 


b)  Z  =  |U| 


c)  Z  =  U2 


(Independent  of  U) 


u 


u 
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d)  The  response  depends  on  the  time  Interval  between  applying  U  and 
measuring  Z,  which  we  forgot  to  consider. 

Example  1.1-2  The  Input  and  output  of  a  system  are  scalar  time  functions 
on  tne  interval  (-,«).  When  the  Input  Is  cos(t),  the  output  Is  sln(t). 

Without  more  Information  we  cannot  distinguish  among 

a)  z(t)  -  cos(t)  Independent  of  U 

b)  z(t)  -  J  U(S)ds 

* o 

c)  z(t)  -  u(t) 

d)  z(t)  -  u^t  -  | 

Example  1.1-3  The  Input  and  output  of  a  system  are  Integers  In  the  range 
1-100.  For  every  Input  except  U  »  37,  we  measure  the  output  and  find  It 
equal  to  the  Input.  We  have  no  mathematical  basis  for  drawing  any  conclusion 
about  the  response  to  the  Input  U  ■  37.  We  could  guess  that  the  output  might 
be  Z  ■  37,  but  there  Is  no  mathematical  justification  for  this  guess  In  the 
problem  as  formulated. 

Our  Inability  to  draw  any  conclusions  In  the  above  examples  (particularly  Example  (1.1-3),  which  seems 
so  obvious  Intuitively)  points  out  the  Inadequacy  of  the  pure  black-box  statement  of  the  system  Identification 
problem.  We  cannot  reconstruct  the  function  F  without  some  guidance  on  choosing  a  particular  function  from 
the  Infinite  number  of  functions  consistent  with  the  results  of  the  experiments  performed. 

We  have  seen  that  the  pure  black  box  system  Identification  problem,  where  absolutely  no  Information  Is 
given  about  the  Internal  structure  of  the  system,  Is  Impossible  to  solve.  The  Information  needed  to  construct 
the  system  function  F  Is  thus  composed  of  two  parts:  Information  which  Is  assumed,  and  Information  which  Is 
deduced  from  the  experimental  data.  These  two  Information  sources  can  closely  Interact.  For  Instance,  the 
experimental  data  could  contradict  the  assumptions  made,  requiring  a  revision  of  the  assumptions,  or  the  data 
could  be  used  to  select  one  of  a  set  of  candidate  assumptions  (hypotheses).  Such  Interaction  tends  to  obscure 
the  role  of  the  assumption,  making  It  seem  as  though  all  of  the  Information  was  obtained  from  the  experimental 
data,  and  thus  has  a  purely  objective  validity.  In  fact,  this  Is  never  the  case.  Realistically,  most  of  the 
Information  used  for  constructing  the  system  function  F  will  be  assumptions  based  on  knowledge  of  the  nature 
of  the  physical  processes  of  the  system.  System  Identification  technology  based  on  experimental  data  Is  used 
only  to  fill  In  the  relatively  small  gaps  In  our  knowledge  of  the  system.  From  this  perspective,  we  recognize 
system  Identification  as  an  extremely  useful  tool  for  filling  In  such  knowledge  gaps,  rather  than  as  a  panacea 
which  will  automatically  tell  us  everything  we  need  to  know  about  a  system.  The  capabilities  of  some  modern 
techniques  may  Invite  the  view  of  system  Identification  as  a  cure-all,  because  the  underlying  assumptions  are 
subtle  and  seldom  explicitly  stated. 

Example  1-1-4  Return  to  the  problem  of  example  (1.1-3).  Seemingly,  not  much 
knowledge  of  the  internal  behavior  of  the  system  is  required  to  deduce  that 
Z  will  be  37  when  U  is  37;  Indeed,  many  common  system  Identification  algo¬ 
rithms  would  make  such  a  deduction.  In  fact,  the  assumptions  made  are  numer¬ 
ous.  The  specification  of  the  set  of  possible  inputs  and  outputs  already 
Implies  many  assumptions  about  the  system;  for  instance,  that  there  are  no 
transient  effects,  or  that  such  effects  are  unimportant.  The  problem  state¬ 
ment  does  not  allow  for  an  event  such  as  the  system  output’s  oscillating 
through  several  values.  We  have  also  made  an  assumption  of  repeatability. 

Perhaps  the  same  experiment  redone  tomorrow  would  produce  different  results, 
depending  on  some  factor  not  considered.  Encompassing  all  of  the  other 
assumptions  is  the  assumption  of  simplicity.  We  have  applied  Occam's  Razor 
and  found  the  simplest  system  consistent  with  the  data.  One  can  easily 
Imagine  useful  systems  that  select  specific  inputs  for  special  treatment. 

Nothing  in  the  data  has  eliminated  such  systems.  We  can  see  that  the  assump¬ 
tions  play  the  largest  role  in  solving  this  problem.  Granted  the  assumption 
that  we  want  the  simplest  consistent  result,  the  deduction  from  the  data  that 
Z  =  U  is  trivial. 

Two  general  types  of  assumptions  exist.  The  first  consists  of  restrictions  on  the  allowable  forms  of 
the  function  F.  Presumably,  such  restrictions  would  reflect  the  knowledge  of  what  functions  are  reasonable 
considering  the  physics  of  the  system.  The  second  type  of  assumption  is  some  criterion  for  selecting  a  "best" 
function  from  those  consistent  with  the  experimental  results.  In  the  following  sections,  we  will  see  that 
these  two  approaches  are  combined- restricting  the  set  of  functions  considered,  and  then  selecting  a  best 
choice  from  this  set. 


1.2  PARAMETER  IDENTIFICATION 

For  physical  systems,  information  about  the  general  form  of  the  system  function  F  can  often  be  derived 
from  knowledge  of  the  system.  Specific  numerical  values,  however,  are  sometimes  prohibitively  difficult  to 
compute  theoretically  without  making  unacceptable  approximations.  Therefore,  the  most  widely  used  area  of 
system  Identification  is  the  subfield  called  parameter  identification. 
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In  parameter  Identification,  the  form  of  the  system  function  Is  assumed  to  be  known.  This  function  con¬ 
tains  a  finite  number  of  parameters,  the  values  of  which  must  be  deduced  from  experimental  data. 

Let  £  be  a  vector  with  the  unknown  parameters  as  Its  elements.  Then  the  system  response  Z  Is  a  knoA'n 
function  of  the  Input  U  and  the  parameter  vector  £.  We  can  restate  this  In  a  more  convenient,  but  com¬ 
pletely  equivalent  way.  For  each  value  of  the  parameter  vector  £,  the  system  response  Z  Is  a  known  function 
of  the  Input  U.  (The  function  can  be  different  for  different  values  of  £.)  We  say  that  the  function  Is 
parameterized  by  £  and  write 


Z  «  F5(U)  (1.2-1) 

The  function  Fe(U)  Is  referred  to  as  the  assumed  system  model.  The  subscript  notation  for  £  Is  used  purely 
for  convenience  to  Indicate  the  special  role  of  £.  The  function  could  be  equivalently  written  as  F(?,u). 

The  parameter  Identification  problem  Is  then  to  deduce  the  value  of  £  based  on  measurement  of  the  responses 
Zi  to  a  set  of  Inputs  U^.  This  problem  of  Identifying  the  parameter  vector  £  Is  much  less  ambitious  than 
the  system  Identification  problem  of  constructing  the  entire  F  function  from  experimental  data;  It  Is  more 
In  line  with  the  amount  of  Information  that  reasonably  can  be  expected  to  be  obtained  from  experimental  data. 

Deducing  the  value  of  £  amounts  to  solving  the  following  set  of  simultaneous  and  generally  nonlinear 
equations. 


Z1  *  F^(U^)  1  -  1,2, ...N  (1.2-2) 

where  N  is  the  number  of  experiments  performed.  Note  that  the  only  variable  In  these  equations  Is  the  param¬ 
eter  vector  £.  The  Uj  and  Zi  represent  the  specific  Input  used  and  response  measured  for  the  1th  experi¬ 
ment.  This  Is  quite  different  from  Equation  (1.2-1)  which  expresses  a  general  relationship  among  the  three 
variables  U,  Z,  and  £. 

Example  1.2-1  In  the  problem  of  example  (1.1-1),  assume  we  are  given  that  the 
response  Is  a  linear  function  of  the  Input 

Z  -  F€(U)  +  a0  +  a,U 

The  parameter  vector  Is  £  *  (aj.aJ*,  the  values  of  at  and  ax  being  unknown. 

We  were  given  that  U  =  -1  and  U  «  +1  both  result  In  Z  ■  1;  thus  Equa¬ 
tion  (1.2-2)  expands  to 

1  *  Ft(-1)  -  a,  -  a4 
1  -  Fe(l)  «  a0  + 

This  system  Is  easy  to  solve  and  gives  a.  *  1  and  a,  ■  0.  Thus  we  have 
F(U)  ■  1  (Independent  of  U). 

Example  1.2-2  In  the  problem  of  example  (1.1-2),  assume  we  know  that  the  sys- 
tem  can  be  represented  as 


z(t)  ■  az(t)  +  bu(t) 

or,  equivalently,  expressing  Z  as  an  explicit  function  of  U, 

Z  =  F^(U):  z(t)  *  j£  ea^*T^  buUJdx 

The  unknown  parameter  vector  for  this  system  Is  £  ■  (a,b)*.  Since 
u(t)  ■  cos(t)  resulted  In  z(t)  *  sln(t),  Equation  (1.2-2)  becomes 

sln(t)  ■  £  ea^"T^  b  cos(i)dx 

for  all  te  {-«•,*) .  This  equation  Is  uniquely  solved  by  a  =  0“  and  b  =  -1. 

Example  1.2-3  In  the  problem  of  Example  (1.1-3),  assume  that  the  system  can 
be  represented  by  a  polynomial  of  order  10  or  less. 

Z  ■  F  (U)  -  £  a/ 

*  n»« 

The  unknown  parameter  vector  Is  £  ■  (a,^ ...a,,)*.  Using  the  experimental 
data  described  In  Example  1.6,  Equation  (1.2-2)  becomes 


1  -  1,2, ..36,38,39. ..100 


1 


i 

! 

I 

N 

*» 

* 
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This  system  of  equations  Is  uniquely  solved  by  at  ■  0,  a2  *  1.  and  a2 
through  a10  all  equalling  0. 

As  Mlth  any  set  of  equations,  there  are  three  possible  result  from  Equation  (1.2-2).  First,  there  can 
be  a  unique  solution,  as  In  each  of  the  examples  above.  Second,  there  could  be  multiple  solutions.  In  which 
case  either  more  experiments  must  be  performed  or  more  assumptions  would  be  necessary  to  restrict  the  set  of 
allowable  solutions  or  to  pick  a  best  solution  In  some  sense.  The  third  possibility  Is  that  there  could  be  no 
solutions,  the  experimental  data  being  Inconsistent  with  the  assumed  equations.  This  situation  will  require  a 
basic  change  In  our  way  of  thinking  about  the  problem.  There  will  almost  never  be  an  exact  solution  with  real 
data,  so  the  first  two  possibilities  are  somewhat  academic.  The  remainder  of  the  document,  and  Section  1.4  In 
particular,  will  address  th'.  general  situation  where  Equation  (1.2-2)  need  not  have  an  exact  solution.  The 
possibilities  of  one  or  more  solutions  are  part  of  the  general  case. 

Example  1.2-4  In  the  problem  of  Example  (1.1-1),  assume  we  are  given  that 
the  response  is  a  quadratic  function  of  the  Input 

Z  ■  F^(»)  -  a,  +  a2U  +  a2U* 

The  parameter  vector  Is  5  ■  (a,,a,,a2)*.  We  were  given  that  U  ■  -1  and 
U  ■  +1  both  result  In  Z  »  1.  With  these  data  Equation  (1.2-2)  expands  to 

1  -  Fc(-1)  -  a#  -  a,  +  a, 

1  -  F5(l)  ■  «„  +  ax  +  a2 

From  this  Information  we  can  deduce  that  a2  «  0,  but  a#  and  a.  are  not 
uniquely  determined.  The  values  might  be  determined  by  performing  the 
experiment  l’  B  0.  A1 ' -rnately,  we  mlgnt  decide  that  the  lowest  order 
system  consistent  with  the  data  available  Is  preferred,  giving  a2  -  0 
and  a,  »  1. 

Example  1.2-5  In  the  problem  of  Example  (1.1-1),  assume  that  we  are  given 
that  the  response  Is  a  linear  function  of  the  Input.  We  were  given  that 
U  *  -1  and  U  *  +1  both  result  In  Z  ■  1.  Suppose  that  the  experiment 
U  «  0  Is  performed  and  results  In  Z  -  0.95.  There  are  then  no  parameter 
values  consistent  with  the  data. 


1.3  TYPES  OF  SYSTEM  MODELS 

Although  the  basic  concept  of  system  modeling  is  quite  general,  more  useful  results  can  be  obtained  by 
examining  specific  types  of  system  models.  Clarity  of  exposition  Is  also  Improved  by  using  specific  models, 
even  when  we  can  obtain  the  result  In  a  more  general  context.  This  section  describes  some  of  the  broad 
classes  of  system  model  forms  which  are  often  used  In  parameter  Identification. 

1.3.1  Explicit  Function 

The  most  basic  type  of  system  model  Is  the  explicit  function.  The  response  Z  Is  written  as  a  known 
explicit  function  of  the  Input  U  and  the  parameter  vector  e.  This  type  of  model  corresponds  exactly  to 
Equation  (1.2-1): 


Z  =  F?(U) 


(1.2-1) 


In  the  simplest  subset  of  the  explicit  function  models,  the  response  is  a  linear  function  of  the 
parameter  vector 


L  -  f(U)c  (1.3-1) 

In  this  equation,  *(U)  Is  a  matrix  which  Is  a  known  function  (nonlinear  In  general)  of  the  Input.  This  Is  the 
type  of  model  used  In  linear  regression.  Many  systems  can  be  put  into  this  easily  analyzed  form,  even  though 
the  systems  might  appear  quite  complex  at  first  glance. 

A  common  example  of  a  model  linear  In  Its  parameters  Is  a  finite  polynomial  expansion  of  Z  In  terms 
of  U. 

Z  =  e0  +  5lU  +  52U2...+  enun  (1.3-2) 

In  this  case,  f(U)  Is  the  row  vector  (1,  U,  U2...Un).  Note  that  Z  Is  linear  In  the  parameters  $■(,  but 
not  In  the  Input  U. 

1.3.2  State  Space 

State-space  models  are  very  useful  for  dynamic  systems;  that  Is,  systems  with  responses  that  are  time 
functions.  Wlberg  (1971)  and  Zadeh  and  Desoer  (1963)  give  general  discussions  of  state-space  models.  Time 
can  be  treated  as  either  a  continuous  or  discretized  variable  In  dynamic  models;  the  theories  of  discrete-  and 
continuous-time  systems  are  quite  different. 
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The  general  form  for  a  contl nuous-tlme  state-space  model  Is 


x(t0)  -  x4 

(1.3-3a) 

x(t)  -  f[x(t),u(t),t,E] 

(1.3-3b) 

z(t)  -  g[x(t),u(t),t,0 

(1.3-3c) 

where  f  and  g  are  arbitrary  known  functions.  The  Initial  condition  x#  can  be  known  or  can  be  a  function 
of  5.  The  variable  x(t)  Is  defined  as  the  state  of  the  system  at  time  t.  Equation  (1.3-3b)  Is  called  the 
state  equation,  and  (1.3-3c)  Is  called  the  observation  equation.  The  measured  system  response  Is  z.  The 
state  Is  not  considered  to  be  measured;  It  Is  an  Internal  system  variable.  However,  g[x(t),u(t),t,5]  ■  x(t) 

Is  a  legitimate  observation  function,  the  measurement  can  be  equal  to  the  state  If  so  desired. 

Discrete-time  state  space  models  are  similar  to  continuous-time  models,  except  that  the  differential 
equations  are  replaced  by  difference  equations.  The  general  form  Is 

x(t0)  *  x0  (1.3-4a) 

x(t1+1)  ■  f[x(tj),u(t^),t^,s3  1  -  0,1,...  (1.3-4b) 

*(tt)  *  gMt^.uU^.^.c]  1  -  1,2,...  (1.3-4c) 

The  system  variables  are  defined  only  at  the  discrete  times  tj. 

This  document  Is  largely  concerned  with  continuous-time  dynamic  systems  described  by  differential  Equa¬ 
tions  (1.3-3b).  The  system  response,  however,  Is  measured  at  discrete  time  points,  and  the  computations  are 
done  In  a  digital  computer.  Thus,  some  features  of  both  discrete-  and  continuous-time  systems  are  pertinent. 
The  system  equations  are 

x(t„)  «  x,  (1.3-5a) 

x(t)  «  f[x(t),u(t),t,s]  (1.3-5b) 

z^)  *  gWt^.uU^),^,*]  1  -  1,2,...  (1.3-5c) 

The  response  z(t-j )  Is  considered  to  be  defined  only  at  the  discrete  time  points  t^,  although  the  state  x(t) 

Is  defined  in  continuous  time. 

We  will  see  that  the  theory  of  parameter  Identification  for  continuous-time  systems  with  discrete  obser¬ 
vations  Is  virtually  identical  to  the  theory  for  discrete-time  systems  In  spite  of  the  superficial  differences 

in  the  system  equation  forms.  The  theory  of  continuous-time  observations  requires  much  deeper  mathematical 

background  and  will  only  be  outlined  In  this  document.  Since  practical  application  of  the  algorithms  devel¬ 
oped  generally  requires  a  digital  computer,  the  continuous-time  theory  Is  of  secondary  Importance. 

An  Important  subset  of  systems  described  by  state  space  equations  is  the  set  of  linear  dynamic  systems. 
Although  the  equations  are  sometimes  rewritten  In  forms  convenient  for  different  applications,  all  linear 
dynamic  system  models  can  be  written  In  the  following  forms:  the  continuous  time  form  Is 


x(t0)  *  x,  (1.3-6a) 

x(t)  »  Ax(t)  +  Bu(t)  (1.3-6b) 

z(t)  *  Cx(t)  +  Du(t)  (1.3-6c) 

The  matrix  A  Is  called  the  stability  matrix,  B  Is  called  the  control  matrix,  and  C  and  D  are  c.  lied  state 
and  control  observation  matrices,  respectively.  The  discrete-time  form  Is 

x(t0)  =  x„  (I.3-7a) 

x(ti+1)  *  ax^)  +  vuUj)  i  =  0,1,...  (1.3-7b) 

z(t^)  =  Cx(t^ )  +  Du(tj)  i  =  1,2,...  (1.3-7c) 


The  matrices  *  and  v  are  called  the  system  transition  matrices.  The  form  for  continuous  systems  with  dis¬ 
crete  observations  Is  Identical  to  Equation  (1.3-6),  except  that  the  observation  Is  defined  only  at  the 
discrete  time  points.  In  all  ti'r;e  forms.  A,  B,  C,  D,  a,  and  v  are  matrix  functions  of  the  parameter 
vector  5.  These  matrices  are  functions  of  time  In  general,  but  for  notatlonal  simplicity,  we  will  not 
explicitly  Indicate  the  time  depended  unless  It  Is  Important  to  a  discussion. 

The  continuous-time  and  discrete-time  state-equation  forms  are  closely  related.  In  many  applications, 
the  discrete-time  form  of  Equation  (1.3-7)  is  used  as  a  discretized  approximation  to  Equation  (1.3-6).  In 
this  case,  the  transition  matrices  a  and  v  are  related  to  the  A  and  B  matrices  by  the  equations 

*  «  exp(Aa)  (1.3-8a) 
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where 


A  -  t<+1  -  ti  (1.3-8c) 

We  discuss  this  relationship  In  more  detail  in  Section  7.5.  In  a  similar  manner,  Equation  (1.3-4)  Is  sometimes 
viewed  as  an  approximation  to  Equation  (1.3-3).  Although  the  principle  In  the  nonlinear  case  Is  the  same  as 
In  the  linear  case,  we  cannot  write  precise  expressions  for  the  relationship  In  such  simple  closed  forms  as  In 
the  linear  case. 

Standardized  canonical  forms  of  the  state-space  equations  (Wlberg,  1971)  play  an  Important  role  In  some 
approaches  to  parameter  estimation.  We  will  not  emphasize  canonical  forms  In  this  document.  The  basic  theory 
of  parameter  Identification  Is  the  ssme,  whether  canonical  forms  are  used  or  not.  In  some  applications, 
canonical  forms  are  useful,  or  even  necessary.  Such  forms,  however,  destroy  any  Internal  relationship  between 
the  model  structure  and  the  system,  retaining  only  the  external  response  characteristics.  Fidelity  to  the 
Internal  as  well  as  to  the  external  system  characteristics  Is  a  significant  aid  to  engineering  judgment  and  to 
the  Incorporation  of  known  facts  about  the  system,  both  of  which  play  crucial  roles  In  system  Identification. 
For  Instance,  we  might  know  the  values  of  many  locations  of  the  A  matrix  In  Its  "natural"  form.  When  the 
A  matrix  Is  transformed  to  a  canonical  form,  these  simple  facts  generally  become  unwieldy  equations  which 
cannot  reasonably  be  used.  When  there  Is  little  useful  knowledge  of  the  Internal  system  structure,  the  use  of 
canonical  forms  becomes  more  appropriate. 

1.3.3  Others 


Other  types  of  system  models  are  used  In  various  applications.  This  document  will  not  cover  them  explic¬ 
itly,  but  many  of  the  Ideas  and  results  from  explicit  function  and  state  space  models  can  be  applied  to  other 
model  types. 


One  of  these  alternate  model  classes  deserves  special  mention  because  of  Its  wide  use.  This  Is  the  class 
of  auto-regressive  moving  average  (ARMA)  models  and  related  variants  (Hajdaslnskl ,  Eykhoff,  Damen,  and  van  den 
Boom,  1982).  Discrete-time  ARMA  models  are  In  the  general  form 


z^)  +  aiz(t1_i) 


V(W  =  b»u(t1)  +  M^-x)  + 


bmu(t1-m> 


(1.3-9) 


Discrete-time  ARMA  models  can  be  readily  rewritten  as  linear  state  space  models  (Schweppe,  1973),  so  all  of 
the  theory  which  we  will  develop  for  state  space  models  Is  directly  applicable. 


1.4  PARAMETER  ESTIMATION 

The  examples  in  Section  1.2  were  carefully  chosen  to  have  exact  solutions.  Real  data  Is  seldom  so 
obliging.  No  matter  how  careful  we  have  been  In  selecting  the  form  of  the  assumed  system  model.  It  will  not 
be  an  exact  representation  of  the  system.  The  experimental  data  will  not  be  consistent  with  the  assumed  model 

form  for  any  value  of  the  parameter  vector  £.  The  model  may  be  close,  but  It  will  not  be  exact.  If  for  no 

other  reason  than  that  the  measurements  of  the  response  will  be  made  with  real,  and  thus  Imperfect, 
instruments. 

The  theoretical  development  seems  to  have  arrived  at  a  cul-de-sac.  The  black  box  system  Identification 
problem  was  not  feasible  because  there  were  too  many  solutions  consistent  with  the  data.  To  remove  this  diffi¬ 
culty,  It  was  necessary  to  assume  a  model  form  and  define  the  problem  as  parameter  Identification.  With  the 

assumed  model,  however,  there  are  no  solutions  consistent  with  the  data. 

We  need  to  retain  the  concept  of  an  assumed  model  structure  In  order  to  reduce  the  scope  of  the  problem, 
yet  avoid  the  Inflexibility  of  requiring  that  the  model  exactly  reproduce  the  experimental  data.  We  do  this 
by  using  the  assumed  model  structure,  but  acknowledging  that  It  Is  Imperfect.  The  assumed  model  structure 
should  Include  the  essential  characteristics  of  the  true  system.  The  selection  of  these  essential  character¬ 
istics  is  the  most  significant  engineering  judgment  In  system  analvsls.  A  good  example  Is  Gauss'  (1809, 
p.  xl)  justification  that  the  major  axis  of  a  cometary  ellipse  Is  not  an  essential  parameter,  and  that  a 
simplified  parabolic  model  Is  therefore  appropriate: 

There  existed.  In  point  of  fact,  no  sufficient  reason  why  It  should  be  taken 
for  granted  that  the  paths  of  comets  are  exactly  parabolic:  on  the  contrary, 

It  must  be  regarded  as  In  the  highest  degree  Improbable  that  nature  should 
ever  have  favored  such  an  hypothesis.  Since,  nevertheless.  It  was  known,  that 
the  phenomena  of  a  heavenly  body  moving  In  an  ellipse  or  hyperbola,  the  major 
axis  of  which  Is  very  great  relatively  to  the  parameter,  differs  very  little 
near  the  perihelion  from  the  motion  In  a  parabola  of  which  the  vertex  Is  at 
the  same  distance  from  the  focus;  and  that  this  difference  becomes  the  more 
Inconsiderable  the  greater  the  ratio  of  the  axis  to  the  parameter:  and  since, 
moreover,  experience  has  shown  that  between  the  observed  motion  and  the  motion 
computed  In  the  parabolic  orbit,  there  remained  differences  scarcely  ever 
greater  than  those  which  might  safely  be  attributed  to  errors  of  observation 
(errors  quite  considerable  In  most  cases):  astronomers  have  thought  proper  to 
retain  the  parabola,  end  very  properly,  because  there  are  no  means  whatever  of 
ascertaining  satisfactorily  what,  If  any,  are  the  differences  from  a  parabola. 

Chapter  11  discusses  some  aspects  of  this  selection,  Including  theoretical  aids  to  making  such  judgments. 

Given  the  assumed  model  structure,  the  primary  question  is  how  to  treat  Imperfections  In  the  model. 

We  noed  to  determine  how  to  select  the  value  of  £  which  makes  the  mathematical  model  the  "best" 
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representation  of  the  essential  characterl sties  of  the  system.  Ho  alto  need  to  evaluate  the  error  In  the 
determination  of  E  duo  to  the  unmodeled  effects  present  In  the  experimental  date.  These  needs  Introduce 
several  new  concepts.  One  concept  Is  that  of  a  "best"  representation  as  opposed  to  the  correct  representation. 
It  Is  often  impossible  to  define  a  single  correct  representation,  even  In  principle,  because  we  have  acknowl¬ 
edged  the  assumed  nodal  structure  to  be  Imperfect  and  we  have  constrained  ourselves  to  work  within  this 
structure.  Thus  e  does  not  have  a  cot:  xt  value.  As  Acton  (1970)  says  on  this  subject, 

A  favorite  forsi  of  lunacy  among  aeronautical  engineers  produces  countless 
attempts  to  decide  what  differential  equation  governs  the  action  of  soaw 
physical  object,  such  as  a  helicopter  rotor.... Out  arguments  about  which 
differential  equation  represents  tnith,  together  with  their  fitting  ealeu* 
latlons.  are  wasted  time. 

Example  1.4-1  Estimating  the  radius  of  the  Earth.  The  Earth  Is  not  a  per¬ 
fect  sphere  and.  thus,  does  not  have  a  radius.  Therefore,  the  probleai  of 
estimating  the  radius  of  the  Earth  has  no  correct  answer.  Nonetheless,  a 
representation  of  the  Earth  as  a  sphere  Is  a  useful  simplification  for 
many  purposes. 

Even  the  concept  of  the  "best"  representation  overstates  the  meaning  of  our  estimates  because  there  Is  no 
universal  criterion  for  defining  a  single  best  representation  (thus  our  quotes  around  "best").  Many  system 
Identification  methods  establish  an  ootlmallty  criterion  and  use  numerical  optimization  methods  to  coaputa  the 
optimal  estimates  as  defined  by  the  criterion;  Indeed  most  of  this  document  Is  devoted  to  such  optimal  esti¬ 
mators  or  approximations  to  them.  To  be  avoided,  however,  Is  the  coomon  attitude  that  optimal,  (by  some  cri¬ 
terion)  is  synonymous  with  correct,  and  that  any  nonoptlmal  estimator  Is  therefore  wrong.  Klein  (1975)  uses 
the  term  "adequate  model"  to  suggest  that  the  appropriate  judgment  on  an  Identified  model  Is  whether  the  model 
Is  adequate  for  Its  Intended  purpose. 

In  addition  to  these  concepts  of  the  correct,  best,  or  adequate  values  of  e»  we  have  the  somewhat  related 
issue  df  errors  In  the  determination  of  E  caused  by  the  presence  of  unmodeled  effects  In  the  experimental 
data.  Even  If  a  correct  value  of  e  Is  defined  In  principle.  It  may  not  be  possible  to  determine  this  value 
exactly  from  the  experimental  data  due  to  contamination  of  the  data  by  unmodeled  effects. 

Me  can  now  define  the  task  as  to  determine  the  best  estimate  of  c  obtainable  from  the  data,  or  perhaps 
an  adequate  estimate  of  e.  rather  than  to  determine  the  correct  value  of  E.  This  revised  problem  Is  more 
properly  called  parameter  estimation  than  parameter  Identification.  (Both  terms  are  often  used  Interchange¬ 
ably.)  Implied  subproblems  of  parameter  estimation  Include  the  definition  of  the  criteria  for  best  or 
adequate,  and  the  characterization  of  potential  errors  In  the  estimates. 

Example  1.4-2  Reconsider  the  problem  of  example  (1.2-5).  Although  there  Is 
no  linear  model  exactly  consistent  with  the  data,  modeling  the  output  as  a 
constant  value  of  1  appears  a  reasonable  approximation  and  agrees  exactly  with 
two  of  the  three  data  points. 

One  approach  to  parameter  estimation  Is  to  minimize  the  error  between  the  model  response  and  the  actual 
measured  response,  using  a  least  squares  or  some  similar  ad  hoo  criterion.  The  values  of  the  parameter 
vector  e  which  result  In  the  minimum  error  are  called  the  best  estimates.  Gauss  (1809,  p.  162)  Introduced 
this  Idea: 


Finally,  as  all  our  observations,  on  account  of  the  Imperfection  of  the 
Instruments  and  of  the  senses,  are  only  approximations  to  the  truth,  an 
orbit  based  only  on  the  six  absolutely  necessary  data  may  still  be  liable  to 
considerable  errors.  In  order  to  diminish  these  as  much  as  possible,  and 
thus  to  reach  the  greatest  precision  attainable,  no  other  method  will  be 
given  except  to  accumulate  the  greatest  number  of  the  most  perfect  observa¬ 
tions,  and  to  adjust  the  elements,  not  so  as  to  satisfy  this  or  that  set  of 
observations  with  absolute  exactness,  but  so  as  to  agree  with  all  In  the 
best  possible  manner. 

This  approach  Is  easy  to  understand  without  extensive  mathematical  background,  and  It  can  produce  excellent 
results.  It  Is  restricted  to  deterministic  models  so  that  the  model  response  can  be  calculated. 

An  alternate  approach  to  parameter  estimation  Introduces  probabilistic  concepts  In  order  to  take  advan¬ 
tage  of  the  extensive  theory  of  statistical  estimation.  Me  should  note  that,  from  Gauss's  time,  these  two 
approaches  have  been  Intimately  linked.  The  sentence  lemedlately  following  the  above  exposition  In  Theorla 
Motus  (Gauss,  1809,  p.  162)  Is 

For  which  purpose,  we  will  show  In  the  third  section  how,  according  to  the 
principles  of  the  calculus  of  probabilities,  such  an  agreement  may  be 
obtained,  as  will  be,  If  In  no  one  place  perfect,  yet  In  all  places  the 
strictest  possible. 

In  the  statistical  approach,  all  of  the  effects  not  Included  In  the  deterministic  system  model  are  modeled  as 
random  noise;  the  characteristics  of  the  noise  and  Its  position  In  the  system  equations  vary  for  different 
applications.  The  probabilistic  treatment  solves  the  perplexing  problem  of  how  to  examine  the  effect  of  the 
unmodeled  portion  of  the  system  without  first  modeling  It.  The  formerly  unmodeled  portion  1$  modeled  proba¬ 
bilistically,  which  allows  description  of  Its  general  characteristics  such  as  magnitude  and  frequency  content, 
without  requiring  a  detailed  model.  Systems  such  as  this,  which  Involve  both  time  and  randomness,  are  referred 
to  as  stochastic  systems.  This  document  will  examine  a  small  part  of  the  extensive  theory  of  stochastic  sys¬ 
tems,  which  can  be  used  to  define  estimates  of  the  unknown  parameters  and  to  characterize  the  properties  of 
these  estimates. 
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Although  this  document  will  devote  significant  time  to  the  treatment  of  the  probabilistic  approach,  this 
tpproach  should  not  be  oversold.  It  Is  currently  popular  to  disparage  model-fitting  approaches  as  nonrigorous 
and  without  theoretical  basis.  Such  attitudes  Ignore  two  Important  facts:  first,  in  many  of  the  most  common 
situations,  the  "sophisticated"  probabilistic  approach  arrives  at  the  same  estimation  algorithm  as  the  model¬ 
fitting  approaches.  This  fact  Is  often  obscured  by  the  use  of  buzz  words  and  unenlightening  notation,  appar¬ 
ently  for  fear  that  the  theoretical  effort  will  be  considered  at  wasted.  Our  view  Is  that  such  relationships 
should  be  emhaslzed  and  clearly  explained.  The  two  approaches  complement  etch  other,  and  the  engineer  who 
uni  irstandsboth  Is  best  equipped  to  handle  real  world  problems.  The  model-fitting  approach  gives  good  Intui¬ 
tive  understanding  of  such  problems  as  modeling  error,  algorithm  convergence,  and  Identlflablllty,  among 
others.  The  probabilistic  approach  contributes  quantitative  characterization  of  the  properties  of  the  esti¬ 
mates  (the  accuracy),  and  an  understanding  of  how  these  characteristics  ere  affected  by  various  factors. 

The  second  fact  Ignored  by  those  who  disparage  model  fitting  Is  that  the  probabilistic  approach  Involves 
just  as  many  (or  more)  unjustified  ad  hoe  assumptions.  Behind  the  smug  front  of  mathematical  rigor  and  sophis¬ 
tication  lie  patently  ridiculous  assumptions  about  the  system.  The  contaminating  noise  seldom  has  any  of  the 
characteristics  (Gaussian,  white,  etc.)  assumed  simply  In  order  to  get  results  in  a  usable  form.  More  basic  is 
the  fact  that  the  contaminating  noise  Is  not  necessarily  random  noise  at  all.  It  Is  a  coiwoslte  of  all  of  the 
otherwise  unmodeled  portions  of  the  system  output,  some  of  which  might  be  "truly"  random  (deferring  the 
philosophical  question  of  whether  truly  random  events  exist),  but  soma  of  which  are  certainly  deterministic 
even  at  the  macroscopic  level.  In  light  of  this  consideration,  the  "rigor"  of  the  probabilistic  approach  Is 
tarn 'shed  from  the  start,  no  matter  how  precise  the  Inner  mathematics.  Contrary  to  the  Impressions  often 
given,  the  probabilistic  approach  Is  not  the  single  correct  answer,  but  Is  one  of  the  possible  avenues  that  can 
give  useful  results,  making  on  the  average  as  many  unjustified  or  blatantly  false  assumptions  as  the  alteroa- 
tlvjs.  Bayes  (1736,  p.  9),  In  an  essay  reprinted  by  Barnard  (1958),  made  a  classical  statement  on  the  role  of 
assumptions  In  mathematics: 

It  Is  not  the  business  of  the  Mathematician  to  dispute  whether  quantities  do 
In  fact  ever  vary  In  the  manner  that  Is  supposed,  but  only  whether  the  notion 
of  their  doing  so  be  Intelligible;  which  being  allowed,  he  has  a  right  to  take 
It  for  granted,  and  then  see  what  deductions  he  can  make  from  that  supposi¬ 
tion _ He  Is  not  Inquiring  how  things  are  In  matter  of  fact,  but  supposing 

things  to  be  In  a  certain  way,  what  are  the  consequences  to  be  deduced  from 
them;  and  all  that  Is  to  be  demanded  of  him  Is,  that  his  suppositions  be 
intelligible,  and  his  Inferences  just  from  the  suppositions  he  makes. 

The  demands  on  the  applications  engineer  are  somewhat  different,  and  more  In  line  with  Bayes'  (1736,  p.  50) 
later  statement  in  the  same  document. 

So  far  as  Mathematics  do  not  tend  to  make  men  more  sober  and  rational  thinkers, 
wiser  and  better  men,  they  are  only  to  be  considered  as  an  amusement,  which 
cught  not  to  take  us  off  from  serious  business. 

A  few  words  are  necessary  in  defense  of  the  probabilistic  approach,  lest  the  reader  decide  that  It  Is  not 
worthwhile  to  pursue.  The  main  Issue  Is  the  description  of  deterministic  phenomena  as  random.  This  disagrees 
with  common  modern  perceptions  of  the  meaning  and  use  of  randomness  for  physical  situations.  In  which  random 
and  deterministic  phenomena  are  considered  as  quite  distinct  and  well  delineated.  Our  viewpoint  owes  more  to 
the  earlier  philosophy  of  probability  theory- that  It  Is  a  useful  tool  for  studying  complicated  phenomena 
which  need  not  be  Inherently  random  (If  anything  Is  Inherently  random).  Cramer  (1946.  p.  141)  gives  a  classic 
exposition  of  this  philosophy: 

[The  following  Is  descriptive  of]. ..large  and  Important  groups  of  random 
experiments.  Small  variations  In  the  Initial  state  of  the  observed  units, 
which  cannot  be  detected  by  our  Instruments,  may  produce  considerable  changes 
In  the  final  result.  The  complicated  character  of  the  laws  of  the  observed 
phenomena  may  render  exact  calculation  practically.  If  not  theoretically, 

Impossible.  Uncontrollable  action  by  small  disturbing  factors  may  lead  to 
Irregular  deviations  from  a  presumed  "true  value". 

It  Is,  of  course,  clear  that  there  Is  no  sharp  distinction  between  these 
various  a»des  of  randomness.  Whether  we  ascribe  e.g.  the  fluctuations  observed 
In  the  results  of  a  series  of  shots  at  a  target  mainly  to  small  variations  In 
the  Initial  state  of  the  projectile,  to  the  complicated  nature  of  the  ballistic 
laws,  or  to  the  action  of  small  disturbing  factors.  Is  largely  a  matter  of 
taste.  Tlie  essential  thing  Is  that,  In  all  cases  where  one  or  more  of  these 
circumstances  are  present,  an  exact  prediction  of  the  results  of  Individual 
experiments  becomes  Impossible,  and  the  Irregular  fluctuations  characteristic 
of  random  experiments  will  appear. 

We  shall  now  see  that,  In  cases  of  this  character,  there  appears  amidst 
all  Irregularity  of  fluctuations  a  certain  typical  form  of  regularity  that 
will  serve  as  the  basis  of  the  mathematical  theory  of  statistics. 

The  probabilistic  methods  allow  quantitative  analysis  of  the  general  behavior  of  these  complicated  phenomena, 
even  though  we  are  unable  to  model  the  exact  behavior. 
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1.5  OTHER  APPROACHES 

Our  tin  In  this  document  Is  to  present  •  unified  viewpoint  of  the  system  1 dent If lest Ion  Idees  leading 
to  maximum-likelihood  estimation  of  the  parameters  of  dynamic  systems,  and  of  the  application  of  these  Ideas. 
There  are  many  completely  different  approaches  to  Identification  of  dynamic  systems. 

There  are  Innumerable  books  and  papers  In  the  system  Identification  literature.  Eykhoff  (1974)  and 
Astrom  and  Eykhoff  (1970)  give  surveys  of  the  field.  Homever,  much  of  the  work  In  system  Identification  Is 
published  outside  of  the  general  body  of  system  Identification  literature.  Many  techniques  have  been  devel¬ 
oped  for  specific  areas  of  application  by  researchers  oriented  more  toward  the  application  area  than  toward 
jeneral  system  Identification  problems.  These  specialized  techniques  are  part  of  the  larger  field  of  system 
Identification,  although  they  are  usually  not  labeled  as  such.  (Sometimes  they  are  recognizable  as  special 
cases  or  applications  of  more  general  results.)  In  the  area  most  familiar  to  us,  aircraft  stability  and  con¬ 
trol  derivatives  were  estimated  from  flight  data  lone  before  such  estimation  was  classified  as  a  system 
Identification  problem  (Doetsch,  1953;  Etkln,  1958;  Flack,  1959;  Greenberg,  1951;  Rampy  and  Berry,  1964; 
Nolowlcz,  1966;  and  Wolowlcz  and  Holleman,  1958). 

We  do  not  even  attempt  here  the  monumental  task  of  surveying  the  large  body  of  system  Identification 
techniques.  Suffice  It  to  say  that  other  approaches  exist,  some  explicitly  labeled  as  system  Identification 
techniques,  and  some  not  so  labeled.  We  feel  that  we  are  better  equipped  to  make  a  useful  contribution  by 
presenting.  In  an  organized  and  comprehensible  manner,  the  viewpoint  with  which  we  are  most  familiar.  This 
orientation  does  not  constitute  a  dismissal  of  other  viewpoints. 

We  have  sometimes  been  asked  to  refute  claims  that.  In  some. specific  application,  a  simple  technique  such 
as  regression  obtained  superior  results  to  a  "sophisticated*  technique  bearing  Impressive-sounding  credentials 
as  an  optimal  nonlinear  maximum  likelihood  estimator.  The  Implication  Is  that  simple  Is  somehow  synonymous 
with  poor,  and  sophisticated  Is  synonymous  with  good,  associations  that  we  completely  disavow.  Indeed,  the 
opposite  association  seems  more  often  appropriate,  and  we  try  to  present  the  maximum  likelihood  estimator  In 
a  simple  light.  We  believe  that  these  methods  are  all  tools  to  be  used  when  they  help  do  the  Job.  We  have 
used  quotations  from  Gauss  several  times  In  this  chapter  to  Illustrate  his  Insight  Into  what  are  still  some  of 
the  Important  Issues  of  the  day,  and  we  will  close  the  chapter  with  yet  another  (Gauss,  1809,  p.  108): 

...we  hope,  therefore.  It  will  not  be  disagreeable  to  the  reader,  that,  besides 
the  solution  to  be  given  hereafter,  which  seems  to  leave  nothing  further  to  be 
desired,  we  have  thought  proper  to  preserve  also  the  one  of  which  we  have  made 
frequent  use  before  the  former  suggested  Itself  to  me.  It  Is  always  profitable 
to  approach  the  more  difficult  problems  In  several  ways,  and  not  to  despise  the 
good  although  preferring  the  better. 
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CHAPTER  2 


2.0  OPTIMIZATION  NETHOOS 

Most  of  the  estimators  In  this  book  require  the  minimization  or  maximization  of  e  nonlinear  function. 
Sometimes  we  can  write  an  explicit  expression  for  the  minimum  or  maximum  point.  In  many  cases,  however,  we 
must  use  an  Iterative  numerical  algorithm  to  find  the  solution.  Therefore  a  background  In  optimization  methods 
Is  mandatory  for  appreciation  of  the  various  estimators. 

Optimization  Is  a  major  field  In  Its  own  right  and  we  do  not  attempt  a  thorough  treatment  or  even  a  survey 
of  the  field  In  this  chapter.  Our  purpose  Is  to  briefly  Introduce  a  few  of  the  (iptlmlzatlon  techniques  most 
pertinent  to  parameter  estimation.  Several  of  the  conclusions  we  draw  about  the  relative  merits  of  various 
algorithms  are  Influenced  by  the  general  structure  of  parameter  estimation  problems  and.  thus,  might  not  be 
supportable  In  a  broader  context  of  optimizing  arbitrary  functions.  Numerous  books  such  as  Rco  (1979). 
Luenberger  (1969),  Luenberger  (1972),  Dixon  (1972),  and  Polak  (1971)  cover  the  defiled  derivation  and  analysis 
of  the  techniques  discussed  here  and  others.  These  books  give  more  thorough  treatments  of  the  optimization 
methods  than  we  have  room  for  here,  but  are  not  oriented  specifically  to  parameter  estimation  problems.  For 
those  Involved  In  the  application  of  estimation  theory,  and  particularly  for  those  wno  will  be  writing  computer 
programs  for  parameter  estimation,  we  strongly  recommend  reading  several  of  these  books.  The  utility  and  effi¬ 
ciency  of  a  parameter  estimation  program  depend  strongly  on  Its  optimization  algorithms.  The  material  In  this 
chapter  should  be  sufficient  for  a  general  understanding  of  the  problems  and  the  kinds  of  algorithms  used,  but 
not  for  the  details  of  efficient  application. 


The  basic  optimization  problem  Is  to  find  the  value  of  th*  vector  x  that  gives  the  smallest  or  largest 
value  of  the  scalar-valued  function  J(x).  Sy  convention  we  will  talk  about  minimization  problems;  any  maxi¬ 
mization  problem  can  be  made  Into  an  equivalent  minimization  problem  by  changing  the  sign  of  the  function.  He 
will  follow  the  widespread  practice  of  calling  the  function  to  be  minimized  a  cost  function,  regardless  of 
whether  or  not  It  really  has  anything  to  do  with  monetary  cost.  To  formalize  the  definition  of  the  problem, 
a  function  J(x)  1$  said  to  have  a  minimum  at  A  If 

J(A)  <  J(x)  (2.0-1) 
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for.  all  x.  This  Is  sometimes  called  an  unconstrained  global  minimum  to  distinguish  It  from  local  and  con¬ 
strained  minima,  which  are  defined  below. 

Two  kinds  of  side  constraints  are  sometimes  placed  on  the  problem.  Equality  constraints  are  In  the  form 

9|(x)  -  0  (2.0-2) 

Inequality  constraints  are  In  the  form 

h^x)  s  0  (2.0-3) 

The  g{  and  hi  are  scalar-valued  functions  of  x.  There  can  be  any  number  of  constraints  on  a  problem.  A 
value  of  x  Is  called  admissible  If  It  satisfies  all  of  the  constraints;  If  a  value  violates  any  of  the  con¬ 
straints  It  Is  Inadmissible.  The  constraints  modify  the  problem  statement  as  follows:  A  Is  the  constrained 
minimum  of  J(x)  1,  x  Is  admissible  and  If  Equation  (2.0-1)  holds  for  all  admissible  x. 

Two  crucial  questions  about  any  optimization  problem  are  whether  a  solution  exists  and  whether  It  Is 
unique.  Tf.'se  questions  are  Important  In  application  as  well  as  In  theory.  A  computer  program  can  spend  a 
long  time  searching  for  a  solution  that  does  not  exist.  A  simple  example  of  an  optimization  problem  with  no 
solution  1$  the  unconstrained  minimization  of  J(x)  -  x.  A  problem  can  also  fall  to  have  a  solution  because 
there  Is  no  x  satisfying  tlx.  constraints.  We  will  say  that  a  problem  that  has  no  solution  Is  Ill-posed. 

A  simple  problem  wltn  a  nonunique  .nlutlon  Is  the  unconstrained  minimization  of  J(x)  »  (xt  -  xt)*,  where  x 
Is  a  2- vector. 

All  of  the  algorithms  t.iac  we  discuss  (and  most  other  algorithms)  search  for  a  local  minimum  of  the  func¬ 
tion,  rather  than  the  global  minimum.  A  local  minimum  (also  called  a  relative  minimum)  Is  defined  as  follows: 
x  Is  a  local  minimum  of  J(x)  If  a  scalar  t  >  0  exists  such  that 

J(x)  <  J(2  +  h)  (2.0-4) 

for  all  h  wltn  |h|  <  To  define  a  constrained  local  mlr  tmum,  we  must  add  tlx  qualifications  that  A 
and  x  +  h  satisfy  the  constraints.  The  term  "extremum"  refers  to  either  a  local  minimum  or  a  local  maximum. 
Figure  (2.0-1)  Illustrates  a  problem  vith  thrv-j  iocuI  minima,  one  of  which  Is  the  global  minimum. 

Note  that  If  a  global  minimum  exists,  even  If  It  Is  not  unique,  It  Is  also  a  local  minimum.  The  converse 

to  this  statement  Is  false;  the  existence  of  a  local  minimum  does  not  even  Imply  that  a  global  minimum  exists. 

We  can  sometimes  pr«.ve  that  a  function  has  only  one  local  minimum  point,  and  that  this  point  Is  also  the 

global  minimum.  When  we  I-ck  such  proofs,  there  Is  no  universal  way  to  guarantee  that  tne  local  minimum  found 

by  an  algorithm  Is  the  global  minimum.  A  reasonable  check  for  Iterative  algorithms  Is  to  try  the  algorithm 
with  many  different  starting  values  widely  distributed  within  the  realm  of  possible  values.  If  the  algorithm 
consistently  converges  to  the  same  starting  point,  that  point  '$  probably  the  global  minimum.  The  cost  of  such 
a  test,  however,  Is  often  prohibitively  high 

The  likelihood  of  local  minima  difficulties  varies  widely  depending  on  the  application.  In  some  applica¬ 
tions  we  can  piove  that  there  are  no  local  minima  except  at  the  unique  global  minimum.  At  the  other  extreme, 
some  applications  are  plaguel  by  numerous  local  minima  to  the  extent  that  most  minimization  algorithms  are 
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worthless.  Most  applications  11a  batwaan  thasa  extrams.  He  can  often  argue  convincingly  that  a  particular 
answer  must  be  the  global  minimum,  even  whan  rigorous  proof  Is  Impractical. 

The  algorithms  In  tnls  chapter  are,  with  a  few  exceptions.  Iterative.  Given  seme  starting  value  xt,  the 
algorithms  »1ve  a  procedure  for  computing  a  new  value  x,;  then  xt  Is  computed  from  x,,  etc.  The  Intent  of 
the  Iterative  algorithms  Is  to  create  a  sequence  xj  that  converges  to  the  minimum.  The  starting  value  can 
be  from  an  Independent  estimate  of  a  reasonable  answer,  or  It  can  com  from  a  special  start-up  algorithm.  The 
final  ,U?  of  an>  Iterative  algorithm  Is  testing  convergence.  After  the  algorithm  has  proceeded  for  some  time, 
we  need  to  choose  among  the  following  alternatives:  1)  the  algorithm  has  converged  to  a  value  sufficiently 
close  to  the  true  minimum  and  should  therefore  be  terminated:  2)  the  algorithm  Is  making  acceptable  progress 
toward  the  solution  and  should  be  continued:  3)  the  algorithm  Is  falling  to  converge  or  Is  converging  too 
slowly  to  obtain  a  solution  In  an  acceptable  time,  and  It  should  therefore  be  abandoned:  or  4)  the  algorithm 
Is  exhibiting  behavior  that  suggests  that  switching  to  a  different  algorithm  (or  modifying  the  current  one) 
might  be  productive.  This  decision  Is  far  from  trivial  because  some  algorithms  can  essentially  stall  at  a 
point  far  from  any  local  minimum,  making  such  small  changes  In  x<  that  they  appear  to  have  converged. 

We  have  briefly  mentioned  the  problems  of  existence  and  uniqueness  of  solutions,  local  minima,  starting 
values,  and  convergence  tests.  These  are  major  Issues  in  practical  application,  but  we  will  not  examine  them 
further  here.  The  references  contain  considerable  discussion  of  these  Issues. 

A  cost  function  of  an  N-dlmenslonal  x  vector  can  be  visualized  as  a  hypersurface  In  (N  +  l)-d1mens1onal 
space.  For  illustrating  the  behavior  of  the  various  algorithms,  we  will  use  Isocline  plots  of  cost  functions 
of  two  variables.  An  Isocline  Is  the  locus  of  all  points  in  the  x-spece  corresponding  to  some  specified  cost 
function  value.  The  Isoclines  of  positive  definite  quadratic  functions  are  always  ellipses.  Furthermore,  a 
quadratic  function  Is  completely  specified  by  one  of  Its  Isoclines  and  the  fact  that  it  Is  quadratic.  Two- 
dimensional  examples  are  sufficient  to  Illustrate  most  of  the  pertinent  points  of  the  algorithms. 

We  will  consider  unconstrained  minimization  problems,  which  Illustrate  the  basic  points  necessary  for  our 
purposes.  The  references  address  problems  with  equality  and  Inequality  constraints. 


2.1  ONE-DIMENSIONAL  SEARCHES 

Optimization  methodology  Is  strongly  Influenced  by  whether  or  not  x  Is  a  scalar.  Because  the  optimiza¬ 
tion  problems  In  this  book  are  generally  mul tl -dimensional ,  the  methods  applicable  only  to  scalar  x  are  not 
directly  relevant. 

Many  of  the  multi -dimensional  optimization  algorithms,  however,  require  the  solution  of  one-dimensional 
subproblems  as  part  of  the  larger  algorithm.  Most  such  subproblems  are  In  the  form  of  minimizing  the  multi¬ 
dimensional  cost  function  with  x  constrained  to  a  line  In  the  multi -dimensional  space.  This  has  the  super¬ 
ficial  appearance  of  a  multi -dimensional  problem,  and  furthermore  one  with  the  added  complications  of  con¬ 
straints.  To  clarify  the  one-dimensional  nature  of  these  subproblems.  express  them  as  follows:  the  vector  x 
is  restricted  to  a  line  defined  by 

x  ■  x0  +  xx,  (2.1-1) 

where  x0  and  x,  are  fixed  vectors,  and  x  Is  a  scalar  variable  representing  position  along  the  line. 
Restricted  to  tnls  line,  the  cost  can  be  written  as  a  function  of  X. 

g(x)  =  J(x,  +  xx,)  (2.1-2) 

The  function  g(X)  Is  a  scalar  function  of  a  scalar  variable,  and  oue-dlmenslonal  minimization  algorithms  apply 
directly.  Substituting  the  minimizing  value  of  X  Into  Equation  (2.1-1)  then  gives  the  minimizing  point  along 
the  line  in  the  space  of  x. 

We  will  not  examine  the  one-dimensional  search  algorithms  In  this  book.  Several  of  the  references  have 
good  treatments  of  the  subject.  We  will  note  that  most  of  the  relevant  one-dimensional  algorithms  Involve 
approximating  the  function  by  a  low-order  polynomial  based  on  the  values  of  the  function  and  Its  first  and 
second  derivatives  at  one  or  more  points.  The  minimum  point  of  the  polynomial,  explicitly  evaluated,  replaces 
one  of  the  original  points,  and  the  process  repeats.  The  distinguishing  features  of  the  algorithms  are  the 
order  of  the  polynomial,  the  number  of  points,  and  the  order  of  :he  derivatives  of  J(x)  evaluated.  Variants 
of  the  algorithms  depend  on  start-up  procedures  and  methods  for  selecting  the  point  to  be  replaced. 

In  some  special  cases  we  can  solve  the  one-dimensional  minimization  problems  explicitly  by  setting  the 
derivative  to  zero,  or  by  other  means,  even  when  we  cannot  explicitly  solve  the  encompassing  multi-dimensional 
problem.  Several  of  our  examples  of  multi-dimensional  algorithms  will  use  explicit  solutions  of  the  one¬ 
dimensional  subproblems  to  avoid  getting  bogged  down  In  detail.  Real  problems  seldom  will  be  so  conveniently 
amenable  to  exact  solution  of  the  one-dimensional  subproblems,  except  where  the  multi -dimensional  problem  could 
be  directly  solved  without  resort  to  Iterative  methods.  Iterative  one-dimensional  searches  are  usually  neces¬ 
sary  with  any  method  that  involves  one-dimensional  subproblems.  We  will  encounter  one  of  the  rare  exceptions 
In  the  estimation  of  variance. 


2.2  DIRECT  METHODS 

Optimization  methods  that  do  not  require  the  evaluation  of  derivatives  of  the  cost  function  are  called 
direct  methods  or  zero-order  methods  (because  they  use  up  to  zeroth  order  derivatives).  These  methods  use 
only  the  cost  function  values. 

Axial  Iteration,  also  called  the  univariate  method  or  coordinate  descent,  Is  the  basis  for  many  of  the 
direct  methods.  In  this  method  we  search  along  each  of  the  coordinate  directions  of  the  x-space,  one  at  a 
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time.  Starting  with  the  point  xt,  fix  the  values  of  all  but  the  first  coordinate,  reducing  tho  problem  to 
one-dimensional  minimization.  Solve  this  problem  using  any  one-dimensional  algorithm.  Call  th*  resulting 
point  x..  Then  fix  the  first  coordinate  at  the  value  so  determined  and  do  a  similar  search  along  the  direc¬ 
tion  of  the  second  coordinate,  giving  the  point  x.,  Continue  these  one-dimensional  searches  until  ea-h  of  the 
N  coordinate  directions  has  been  searched;  the  final  point  of  this  process  Is  X||. 

The  point  xn  completes  the  first  cycle  of  minimization.  Repeat  this  cycle  starting  from  the  point  x^ 
Instead  of  x..  Continue  repeating  the  minimization  cycle  until  the  process  converges  (or  until  you  give  up. 
which  may  well  come  first). 

The  performance  of  the  axial  Iteration  algorithm  on  most  problems  Is  unacceptably  poor.  The  algorithm 
performs  well  only  when  the  minimum  point  along  each  axis  Is  nearly  Independent  of  the  values  of  the  other 
coordinates. 


Exa-ole  2.2-1  Use  axial  Iteration  to  minimize  J(x,y)  ■  A(x  -  y)1  ♦  B(x  *  y)1 
wTtn  A  »  6.  The  solution  Is  the  trivially  obvious  (0,0),  but  the  problem 
Is  good  for  illustrating  the  behavior  of  algorithms  In  a  simple  case.  Instead 
of  using  a  one-dimensional  search  procedure,  we  will  explicitly  solve  the  one- 
dimensional  subproblems.  For  any  fixed  y,  obtain  the  minimizing  x  coordi¬ 
nate  value  by  setting  the  derivative  to  zero 

£  J(x,y)  -  2A(x  -  y)  ♦  2B(x  ♦  y)  -  0 


giving 


x ■ rrt > 


Similarly,  for  fixed  x,  the  minimizing  y  value  Is 


y 


A  -  B  . 

m x 


We  see  that  for  A  »  B,  the  values  of  x  and  v  descend  slowly  toward  the  true  minimum  at  (0,0). 

Figure  (2.2-1)  Illustrates  this  behavior  on  an  Isocline  plot.  Note  that  If  A  -  B  (the  cost  function  Isocline 
Is  circular)  the  exact  minimum  Is  obtained  In  one  cycle,  but  as  A/B  Increases  the  performance  worsens. 

Several  modifications  to  the  basic  axial  Iteration  method  are  available  to  Improve  Its  performance.  Some 
of  these  modifications  exploit  the  notion  of  the  pattern  direction,  the  direction  from  the  beginning  point 
XjvH  of  a  cycle  to  the  end  point  x(i+1)„m  of  the  same  cycle.  Figure  (2.2-2)  Illustrates  the  pattern  direc¬ 
tion,  which  tends  to  point  In  the  general  direction  of  the  minimum.  Powell's  method  is  the  most  powerful  of 
the  direct  methods  that  search  along  pattern  directions.  See  the  references  for  details. 


2.3  GRADIENT  METHODS 

Optimization  methods  that  use  the  first  derivative  (gradient)  of  the  cost  function  are  called  gradient 
methods  or  first  order  methods.  Gradient  methods  require  that  the  cost  function  be  differentiable;  most  of  the 
cost  functions  considered  In  this  book  meet  this  requirement.  The  gradient  methods  generally  converge  In  fewer 
Iterations  than  many  of  the  direct  methods  because  the  gradient  methods  use  more  Information  In  each  Iteration. 
(There  are  exceptions,  particularly  when  comparing  simple-minded  gradient  methods  with  the  most  powerful  of  the 
direct  methods).  The  penalty  paid  for  the  generally  Improved  performance  of  the  gradient  methods  compared  with 
the  direct  methods  Is  the  requirement  to  evaluate  the  gradient. 

We  define  the  gradient  of  the  function  J(x)  with  respect  to  x  to  be  the  row  vector.  (Some  texts  define 
It  as  a  column  vector;  the  difference  Is  Inconsequential  as  long  as  one  Is  consistent.) 

V<x)  5  ^5x7  3x7  3X^]J(X)  (2-3-1) 

A  reasonable  estimate  of  the  computational  cost  of  evaluating  the  gradient  Is  N  times  the  cost  of  evaluating 
the  function.  This  estimate  follows  from  the  fact  that  the  gradient  can  be  approximately  evaluated  by  N 
finite  differences 


aJlxl  £J(X  +  tel)  ' 

Jilil  . - 1 -  (2-3-2) 

where  Is  the  unit  vector  along  the  xj  axis  and  c  Is  a  small  number.  In  special  cases,  there  can  be 
expressions  for  the  gradient  that  cost  significantly  less  than  N  function  evaluations. 

Equation  (2.3-2)  somewhat  obscures  the  distinction  between  the  gradient  methods  and  the  direct  methods. 
We  can  rewrite  any  gradient  method  In  a  finite  difference  form  that  does  not  explicitly  Involve  gradients. 
There  Is,  nonetheless,  a  fairly  clear  distinction  between  methods  derived  from  gradient  Ideas  and  methods 
derived  from  direct  search  Ideas.  We  will  retain  this  philosophical  distinction  regardless  of  whether  the 
gradients  are  evaluated  explicitly  or  by  finite  differences. 

The  method  of  steepest  descent  (also  called  the  gradient  method)  Involves  a  series  of  one-dimensional 
searches,  as  did  the  axial -Iteration  method  and  Its  variants.  In  the  steepest-descent  method,  these  searches 


art  along  tht  direction  of  the  negative  of  tht  gradient  vector,  evaluated  at  the  currant  point.  The  one- 
dltntnslonal  problem  Is  to  find  tht  value  of  X  that  minimizes 

J^x)  s  J(xt  ♦  xSl)  (2.3-3) 

where  s<  Is  tht  starch  direction  glvtn  by 

S1  *  <2-3‘4> 

Tht  ntgatlvt  of  tht  gradlant  1$  tht  dlrtctlon  of  steapast  local  descent  of  tht  cost  function  (thus  the 
name  of  tht  method).  To  prove  this  pnperty,  first  note  that  for  any  vector  s  Me  have 

^  J(x  +  xs)  -  <s,V*J(x)>  (2.3-5) 

We  are  using  the  (...)  notation  for  the  Inner  prod‘-t 

<x,y>  s  x*y  (2.3-6) 

Equation  (2.3-5)  Is  a  generalization  of  the  definition  of  the  gradient;  It  applies  In  spaces  where  Equa¬ 
tion  (2.3-1)  is  not  meaningful.  We  then  need  only  show  that,  If  s  Is  restricted  to  be  a  unit  vector, 
Equation  (2.3-5)  Is  minimized  by  choosing  s  In  the  direction  of  -vjjfx).  This  follows  Immediately  from  the 
Cauchy-Schwartz  Inequality  (Luenberger,  1969)  of  linear  algebra. 

Theorem  2.3-1  (Cauchy-Schwartz)  <x,y>*  s  |x|*|y|*  with  equality  If  and  only  If  x  ■  ay  for 
some  scalar  a. 

Proof  The  theorem  Is  trivial  If  y  ■  0.  For  y  j*  0  examine 

<x  +  xy.x  +  Xy>  -  <x,x>  +  xVy.y)  +  2X(x,y>  i  0  (2.3-7) 


Choose 


X  ■  -<x,y>/<y,y>  (2.3-8) 

Substitute  Into  Equation  (2.3-7)  and  rearrange  to  give 

<x.y>*  s  <x,x>(y,y>  •  |x|*jy|*  (2.3-9) 

Equality  holds  If  and  only  If  x  +  Xy  ■  0  In  Equation  (2.3-7),  which  will  be 
true  If  and  only  If  x  -  ay  (x  will  then  be  -a). 

On  the  surface,  the  steepest  descent  property  of  the  method  seems  to  Imply  excellent  performance  In  mini¬ 
mizing  the  cost  function  value.  The  direction  of  steepest  descent,  however.  Is  a  local  property  which  might 
point  far  from  the  direction  of  the  global  minimum.  It  Is  thus  often  a  poor  choice  of  search  direction. 

Direct  methods  such  as  Powell's  often  converge  more  rapidly  than  stoepest  descent. 

The  steepest  descent  method  performs  worst  In  long  narrow  valleys  of  the  cost  function.  It  Is  also  sensl 
tlve  to  scaling.  These  two  difficulties  are  closely  related;  rescaling  a  problem  can  easily  create  long 
narrow  valleys.  The  following  examples  Illustrate  the  scaling  and  valley  difficulties: 

Example  2.3-1  Let  the  cost  function  be 

J(x)  -  |  (x*  +  x|) 

The  steepest  descent  method  works  excellently  for  this  cost  function  (so  does 
almost  every  optimization  method).  The  gradient  of  J(x)  is 

vxJ(x)  -  (Xj.Xj)  »  x* 

Therefore,  from  any  starting  point,  the  negative  of  the  gradient  points 
exactly  at  the  origin,  which  Is  the  global  minimum.  The  minimum  will  be 
attained  exactly  (or  to  the  accuracy  of  the  one-dimensional  search  methods 
used)  In  one  Iteration.  Figure  (2.3-1)  Illustrates  the  algorithm  starting 
from  the  point  (1,1)*. 

Example  2.3-2  Rescale  the  preceding  example  by  replacing  Xl  by  O.lXj. 

(Perhaps  we  just  redefined  the  units  of  xt  to  be  millimeters  Instead  of 
centimeters.)  The  cost  function  Is  then 

J(x)  -  |  (O.Qlx*  +  x*) 


and  the  gradient  Is 


VW  “  (O-OlXj.Xj) 


Figure  (2.3-2)  shows  the  search  direction  used  by  the  algorithm  starting  from 
the  point  (10,1)*,  which  corresponds  to  the  point  (1,1)*  In  the  previous 
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example.  The  starch  direction  points  almost  90s  from  tha  origin.  A  caralass 
glanca  at  Flgura  (2.3*2)  Invites  tha  conclusion  that  tha  minimum  In  tha 
saarch  direction  Mill  ba  i*.  the  x  axis  and  thus  that  tha  second  Iteration 
of  tha  steepest  descant  algorithm  will  attain  the  minimum.  It  Is  true  that 
tha  minimum  Is  close  to  tha  x  axis,  but  It  Is  not  exactly  on  tha  axis,  tha 
distinction  makes  an  Important  difference  In  tha  algorithm's  performance. 

For  points  x  -  xviJ(x)  along  tha  search  direction  from  any  point 
(xi.x,)*,  the  cost  function  Is 

g(x)  -  f(x  -  x?*J(x) )  -  |  [O.OlxJU  -  0.01X)*  ♦  x*(l  -  X)*] 

Tha  minimum  of  g(x)  1$  at 


(0.01)*x*  +  x* 

X  . - i - L 

(O.Ol)’xJ  +  x| 

and  thus  the  minimum  point  along  the  search  direction  Is 

(xx  -  0.01x1x  ,  x,  -  x,X)* 

with  x  defined  as  above.  Tlia  following  table  and  Flgura  (2.3*3)  show 
several  Iterations  of  this  process  starting  from  the  point  (10.1)*. 


Iteration  x1 

0 

10 

1 

1 

9.899 

-.009899 

2 

4.900 

.4900 

3 

4.851 

-.004851 

4 

2.401 

.2401 

5 

2.377 

-.002377 

6 

1.176 

.1176 

7 

1.165 

-.001165 

The  trend  of  the  algorithm  Is  clear;  every  two  iterations  It  moves  essentially 
halfway  to  the  solution.  Consider  the  behavior  starting  from  the  point 
(10,0.1)*  Instead  of  (10,1)*: 


Iteration  xt  x. 

0 

10  0,1 

1 

9.802  -.09802 

2 

9.608  .09608 

3 

9.418  -.09418 

4 

9.231  .09231 

5 

9.048  -.09048 

6 

8.869  .08869 

7 

8.694  -.08694 

This  behavior,  plotted  In  Figure  (2.3-4),  Is  abysmal.  The  algorithm  Is  bounc¬ 
ing  back  and  forth  across  the  valley,  making  little  progress  toward  the 
minimum. 

Several  modifications  to  the  steepest  descent  method  are  available  to  Improve  Its  performance.  A  rescaling 
step  to  eliminate  valleys  caused  by  scaling  yields  major  Improvements  for  some  problems.  The  method  of  paral¬ 
lel  tangents  (PARTAN  method)  exploits  pattern  directions  similar  to  those  discussed  In  Section  2.2;  searches  in 
such  pattern  directions  are  often  called  acceleration  steps.  The  conjugate  gradient  method  is  the  most  power¬ 
ful  of  the  modifications  to  steepest  descent.  The  references  discuss  these  and  other  gradient  algorithms  In 
detail . 


2.4  SECOND  ORDER  METHODS 

Optimization  methods  that  use  the  second  derivative  (or  an  approximation  to  It)  of  the  cost  function  are 
called  second  order  methods.  These  methods  require  that  the  first  and  second  derivatives  of  the  cost  function 
exist. 


2.4.1  Newton-Raohson 

The  Newton-Raphson  optimization  algorithm  (also  called  Newton's  method)  Is  the  basis  for  all  of  the  second 
order  methods.  The  Idea  of  this  algorithm  Is  to  approximate  the  cost  function  by  the  first  three  terms  of  Its 
Taylor  series  expansion  about  the  current  point. 
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V*)  s  J{x1 )  +  (x  -  x^vJOUj)  +  |  (x  -  x1)*[vjj(x1)](x  -  x^  (2.4-1) 

From  a  geometric  viewpoint,  this  equation  describes  the  paraboloid  that  best  approximates  the  function  near 
x^.  Equating  the  gradient  of  Jf(x)  to  zero  gives  an  equation  for  the  minimum  point  of  the  approximating 
function.  Taking  this  gradient,  note  that  vx0(xj)  and  vJJ(x-f)  are  evaluated  at  the  fixed  point  x^  and 
thus  are  not  functions  of  x. 


vxJi (x)  “  V^i*  +  <x  ■  xi)*£v£J(xi)3  * 0 


(2.4-2) 


The  solution  Is 


x  -  x1  -  OJ.Kxj)]-1?*.)^)  (2.4-3) 

If  the  second  gradient  of  J  Is  positive  definite,  then  Equation  (2.4-3)  gives  the  exact  unique  minimum  of 
the  approximating  function;  it  Is  a  reasonable  guess  at  an  approximate  minimum  of  the  original  function.  If 
the  second  gradient  Is  not  positive  definite,  then  the  approximating  function  does  not  have  a  unique  minimum 
and  the  algorithm  Is  likely  to  perform  poorly.  The  Newton-Raphson  algorithm  uses  Equation  (2.4-3)  Iteratively; 
the  x  from  this  equation  Is  the  starting  point  for  the  next  Iteration.  The  algorithm  Is 

xx+i  “  X1  ‘  CvxJ(xi^*lvxJ(xl)  (2.4-4) 

The  performance  of  this  algorithm  In  the  close  neighborhood  of  a  strict  local  minimum  is  unexcelled;  this 
performance  represents  an  Ideal  toward  which  other  algorithms  strive.  The  Newton-Raphson  algorithm  attains 
the  exact  (except  for  numerical  round-off  errors)  minimum  of  any  positive-definite  quadratic  function  In  a 
single  Iteration.  Convergence  within  5  to  10  Iterations  Is  common  on  some  practical  nonquadratic  problems 
with  several  dozen  dimensions;  direct  and  gradient  methods  typically  count  Iterations  In  hundreds  and  thousands 
for  such  problems  and  settle  for  less  accurate  answers.  See  the  references  for  analysis  of  convergence 
characteristics. 

Three  negative  features  of  the  Newton-Raphson  algorithm  balance  Its  excellent  convergence  near  the  mini¬ 
mum.  First  is  the  behavior  of  the  algorithm  far  from  the  minimum.  If  the  Initial  estimate  Is  far  from  the 
minimum,  the  algorithm  often  converges  erratically  or  even  diverges.  Such  problems  are  often  associated  with 
second  gradient  matrices  that  are  not  positive  definite.  Because  of  this  problem.  It  Is  common  to  use  special 
start-up  procedures  to  get  within  the  area  where  Newton-Raphson  performs  well.  One  such  procedure  Is  to  start 
with  a  gradient  method,  switching  to  Newton-Raphson  near  the  minimum.  There  are  many  other  start-up  proce¬ 
dures,  and  they  play  a  key  role  In  successful  applications  of  the  Newton-Raphson  algorithm. 

The  second  negative  feature  of  the  Newton-Raphson  method  is  the  computational  cost  and  complexity  of  eval¬ 
uating  the  second  gradient  matrix.  The  magnitude  of  this  difficulty  varies  widely  among  applications.  In  some 
special  cases  the  second  gradient  is  little  harder  to  compute  than  the  first  gradient;  Newton-Raphson,  perhaps 
with  a  start-up  procedure.  Is  a  good  choice  for  such  applications.  If,  at  the  other  extreme,  you  are  reduced 
to  finite-difference  computation  of  the  second  gradient,  Davldon-Fletcher-Powell  (Section  2.4.4)  is  probably 
a  more  appropriate  algorithm.  In  evaluating  the  computational  burden  of  New  on-Raphson  and  other  methods, 
remember  that  Newton-Raphson  requires  no  one-dimensional  searches.  Equation  (2.4-4)  constitutes  the  entire 
algorithm.  The  one-dimensional  searches  required  by  most  other  algorithms  can  account  for  a  majority  of  their 
computational  cost. 

The  third  negative  feature  of  the  Newton-Raphson  algorithm  is  the  necessity  to  Invert  the  second  gradient 
matrix  (or  at  least  to  solve  the  set  of  linear  equations  Involving  the  matrix).  The  computer  time  required 
for  the  invers.on  is  seldom  an  issue;  this  time  is  typically  small  compared  to  the  time  required  to  evaluate 
the  second  gradient.  Furthermore,  the  algorithm  converges  quickly  enough  that  If  one  linear  system  solution 
per  iteration  is  a  large  fraction  of  the  total  cost,  then  the  total  cost  must  be  low,  even  If  the  linear  system 
is  on  the  order  of  100-by-100.  The  crucial  Issue  concerning  the  inversion  of  the  second  gradient  is  the  jiossl - 
bility  that  the  matrix  could  be  singular  or  ill-conditioned.  We  will  discuss  singularities  In  Section  2.4.3. 

2.4.2  Invariance 


The  Newton-Raphson  algorithm  has  far  less  difficulty  with  long  narrow  valleys  of  the  cost  function  than 
does  the  steepest-descent  method.  This  difference  is  related  to  an  Invariance  property  of  the  Newton-Raphson 
algorithm.  Invariance  of  minimization  algorithms  Is  a  useful  concept  which  many  texts  mention  briefly.  If  at 
all.  We  will  therefore  elaborate  somewhat  on  the  subject. 

The  examples  In  the  section  on  steepest  descent  Illustrate  a  strong  link  between  scaling  and  narrow 
valleys.  Scaling  changes  can  easily  create  such  valleys.  Therefore  we  can  generally  state  that  minimization 
methods  that  are  sensitive  to  scaling  changes  are  likely  to  behave  poorly  in  narrow  valleys. 

This  reasoning  suggests  a  simple  criterion  for  evaluating  optimization  algorithms:  a  good  optimization 
algorithm  should  be  invariant  under  scaling  changes.  This  principle  Is  almost  so  self-evident  as  to  be 
unworthy  of  mention.  The  user  of  a  program  would  be  justifiably  disgruntled  if  an  algorithm  that  worked  in 
the  English  Gravitational  System  (Imperial  System)  of  units  failed  when  applied  to  the  same  problem  expressed 
in  metric  units  (or  vice  versa).  Someone  trying  to  duplicate  reported  results  would  be  perplexed  by  data 
published  In  metric  units  which  could  be  duplicated  only  by  converting  to  English  Gravitational  System  units, 
in  which  the  computation  was  really  done.  Nonetheless,  many  common  algorithms,  including  the  steepest  descent 
method,  fall  to  exhit  it  Invariance  under  scaling. 

The  criterion  Is  neither  necessary  or  sufficient.  It  is  easy  to  construct  ridiculous  algorithms  that  are 
invariant  to  scale  changes  (such  as  the  algorithm  that  always  returns  the  value  zero), and  scale-sensitive  algo¬ 
rithms  like  the  steepest  descent  method  have  achieved  excellent  results  in  some  applications.  It  is  safe  to 
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state,  however,  that  you  can  usually  Improve  a  good  scale-sensitive  algorithm  by  making  It  scale-invariant. 

An  Initial  step  that  rescales  the  problem  can  effectively  make  the  steepest-descent  method  scale- Invariant 
(although  such  a  step  destroys  a  different  Invariance  property  of  the  steepest-descent  method:  Invariance 
under  rotation  of  coordinates).  Rescaling  a  problem  can  be  done  manually  by  the  user,  or  It  can  be  an  auto¬ 
matic  part  of  an  algorithm;  automatic  rescaling  has  the  obvious  advantage  of  being  easier  for  the  user,  and  a 
secondary  advantage  of  allowing  dynamic  scaling  changes  as  the  algorithm  proceeds. 

Me  can  extend  the  Idea  of  Invariance  beyond  scale  changes.  In  general,  we  would  like  an  algorithm  to  be 
Invariant  under  the  largest  possible  set  of  transformations.  A  justification  for  this  criterion  Is  that 
almost  any  complicated  minimization  problem  can  be  expressed  as  some  transformation  (possibly  quite  compli¬ 
cated)  of  a  simpler  problem.  He  can  sometimes  use  such  transformations  to  simplify  the  solution  of  the  origi¬ 
nal  problems.  Often  It  Is  more  difficult  to  do  the  transformation  than  to  solve  the  original  optimization 
problem.  Even  If  we  cannot  do  the  transformations,  we  can  use  the  concept  to  conclude  that  an  optimization 
algorithm  Invariant  over  a  large  class  of  transformations  Is  likely  to  work  on  a  large  class  of  problems. 

The  Newton-Raphson  algorithm  Is  Invariant  under  all  Invertible  linear  transformations.  This  Is  the  widest 
Invariance  property  that  we  can  usually  achieve. 

The  scale-invariance  of  the  Newton-Raphson  algorithm  can  be  partially  nullified  by  poor  choice  of  matrix 
Inversion  (or  linear  system  solution)  algorithms.  We  have  assumed  exact  arithmetic  In  the  preceding  discussion 
of  scale-invariance.  Some  matrix  Inversion  routines  are  sensitive  to  scaling  effects.  Inversion  based  on 
Cholesky  factorization  (Wilkinson,  1965,  and  Acton,  1970)  Is  a  good,  easily  Implemented  method  for  symnetrlc 
matrices  (the  second  gradient  Is  always  symmetric),  and  Is  Insensitive  to  scaling.  Alternatively,  we  can  pre¬ 
scale  the  matrix  by  using  Its  diagonal  elements. 

2.4.3  Singularities 

The  second  gradient  matrix  used  In  the  Newton-Raphson  algorithm  Is  positive  definite  In  a  region  near  a 
strict  local  minimum.  Ideally,  the  start-up  procedure  will  reach  such  a  region,  and  the  Newton-Raphson  algo¬ 
rithm  will  then  converge  without  needing  to  contend  with  singularities.  This  viewpoint  Is  overly  optimistic; 
singular  or  111-condltionod  matrices  (the  difference  Is  largely  academic)  arise  In  many  situations.  In  the 
following  discussion,  we  discount  the  effects  of  scaling.  Matrices  that  have  large  condition  numbers  because 
of  scaling  do  not  represent  Intrinsically  Ill-conditioned  problems,  and  do  not  require  the  techniques  dis¬ 
cussed  In  this  section. 

In  some  situations,  the  second  gradient  matrix  Is  exactly  singular  for  all  values  of  x;  two  columns  (and 
rows)  are  Identical  or  a  column  (and  corresponding  row)  Is  zero.  These  simple  singularities  occur  regularly 
even  In  complex  nonlinear  problems.  They  often  result  from  errors  In  the  problem  formulation,  such  as  minimiz¬ 
ing  with  respect  to  a  parameter  that  is  Irrelevant  to  the  cost  function. 

In  the  more  general  case,  the  second  gradient  Is  singular  (or  Ill-conditioned)  at  some  points  but  not  at 
others.  Whenever  we  use  the  term  singular  In  the  following  discussion,  we  Implicitly  mean  singular  or  Ill- 
conditioned.  Because  of  this  definition,  there  will  be  vaguely  defined  regions  of  singularity  rather  than 
isolated  points.  The  consequences  of  singularities  are  different  depending  on  whether  or  not  they  are  near 
the  minimum. 

Singularities  far  from  the  minimum  pose  no  basic  theoretical  difficulties.  There  are  several  practical 
methods  for  handling  such  singularities.  One  method  Is  to  use  a  gradient  algorithm  (or  any  other  algorithm 
unaffected  by  such  singularities)  until  x  Is  out  of  the  region  of  singularity.  We  can  also  use  this  method 
if  the  second  gradient  matrix  has  negative  eigenvalues,  whether  the  matrix  Is  Ill-conditioned  or  not.  If  the 
matrix  has  negative  eigenvalues,  the  Newton-Raphson  algorithm  Is  likely  to  behave  poorly.  (It  could  even  con¬ 
verge  to  a  local  maximum.)  The  second  gradient  Is  always  positive  semi-definite  in  a  region  around  a  local 
minimum,  so  negative  eigenvalues  are  only  a  consideration  away  from  the  minimum. 

Another  method  of  handling  singularities  is  to  add  a  small  positive  definite  matrix  to  the  second  gradient 
before  inversion.  We  can  also  use  this  method  to  handle  negative  eigenvalues  If  the  added  matrix  Is  large 
enough.  This  method  Is  closely  related  to  the  previous  suggestion  of  using  a  gradient  algorithm.  If  the  added 
matrix  is  a  large  constant  times  an  Identity  matrix,  the  Newton-Raphson  algorithm,  so  modified,  gives  a  small 
step  In  the  negative  gradient  direction.  For  small  constants,  the  algorithm  has  characteristics  between  those 
of  steepest  descent  and  Newton-Raphson.  The  computational  cost  of  this  method  Is  high;  In  essence,  we  are 
getting  performance  like  steepest  descent  while  paying  the  computational  cost  of  Newton-Raphson.  Even  small 
additions  to  the  second  derivative  matrix  can  dramatically  change  the  convergence  behavior  of  the  Newton- 
Raphson  algorithm.  We  should  therefore  discontinue  this  modification  when  out  of  the  region  of  singularity. 

The  advantage  of  this  method  Is  Its  simplicity;  excluding  the  test  of  when  the  matrix  Is  Ill-conditioned,  this 
modification  can  be  done  in  two  short  lines  of  FORTRAN  code. 

The  last  method  Is  to  use  a  pseudo-inverse  (rank-deficient  solution).  Penrose  (1955),  Aokl  (1967), 
Luenberger  (1969),  Wilkinson  and  Relnsch  (1971),  Moler  and  Stewart  (1973),  and  Garbow,  Boyle,  Dongarra,  and 
Moler  (1977)  discuss  pseudo-inverses  In  detail.  The  basic  idea  of  the  pseudo-inverse  method  Is  to  ignore  the 
directions  in  the  x-space  corresponding  to  zero  eigenvalues  (within  some  tolerance)  of  the  second  gradient. 

In  the  parameter  estimation  context,  such  directions  represent  parameters,  or  combinations  of  parameters,  about 
which  the  data  give  little  Information.  Lacking  any  Information  to  the  contrary,  the  method  leaves  such  param¬ 
eter  combinations  unchanged  from  their  initial  values. 

The  pseudo-inverse  method  does  not  address  the  problem  of  negative  eigenvalues,  but  It  Is  popular  In  a 
large  class  of  applications  where  negative  eigenvalues  are  Impossible.  The  method  Is  easy  to  Implement,  being 
only  a  rewrite  of  the  matrix-inversion  or  linear-system- solution  subroutine.  It  also  has  a  useful  property 
absent  from  the  other  proposed  methods;  It  does  not  affect  the  Newton-Raphson  algorithm  when  the  matrix  Is 
well-conditioned.  Therefore  one  can  freely  apply  this  method  without  testing  whether  it  Is  needed.  (It  Is 
true  that  condition  tests  In  some  form  are  part  of  a  pseudo-inverse  algorithm,  but  such  tests  are  at  a  lower 
level  contained  within  the  pseudo- Inverse  subroutine.) 
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Singularities  near  the  minimum  require  special  consideration.  The  excellent  convergence  of  Newton?- 
Raphson  near  the  minimum  Is  the  primary  reason  for  using  the  algorithm.  If  we  significantly  slow  the  conver¬ 
gence  near  the  minimum,  there  Is  little  argument  for  using  Newton- Raphson.  The  use  of  a  pseudo-inverse  can 
handle  singularities  while  maintaining  the  excellent  convergence;  the  pseudo- Inverse  Is  thus  an  appropriate 
tool  for  this  purpose. 

Although  pseudo- Inverses  handle  the  computational  problems,  singularities  near  the  minimum  also  raise 
theoretical  and  application  Issues.  Such  a  singularity  Indicates  that  the  minimum  point  Is  poorly  defined. 

The  cost  function  Is  essentially  flat  In  at  least  one  direction  from  the  minimum,  and  the  minimum  value  of  the 
cost  function  might  be  attained  to  machine  accuracy  by  widely  separated  points.  Although  the  algorithm  con¬ 
verges  to  a  minimum  point.  It  might  be  the  wrong  minimum  point  If  the  minimum  Is  flat.  If  the  only  goal  Is  to 
minimize  the  cost  function,  any  minimizing  point  might  be  acceptable.  In  the  applications  of  this  book,  mini¬ 
mizing  the  cost  function  Is  only  a  means  to  an  end;  the  desired  output  Is  the  value  of  x.  If  multiple  solu¬ 
tions  exist,  the  problem  statement  Is  Incomplete  or  faulty. 

We  strongly  advise  avoiding  the  routine  use  of  pseudo-inverses  or  other  computational  <r jchlnatlons  to 
"solve"  uniqueness  problems.  If  the  basic  problem  statement  Is  faulty,  no  numerical  trick  will  solve  it.  The 
pseudo- Inverse  works  by  changing  the  problem  statement  of  the  Inversion,  adding  the  stipulation  that  the 
Inverse  have  minimum  norm.  The  Interpretation  of  this  stipulation  Is  vague  In  the  context  of  the  optimization 
problem  (unless  the  cost  function  Is  quadratic.  In  which  case  It  specifies  the  solution  nearest  the  starting 
point).  If  this  stipulation  Is  a  reasonable  addition  to  the  problem  statement,  then  the  pseudo- Inverse  Is  an 
appropriate  tool.  This  decision  can  have  significant  effects.  For  a  nonquadratic  cost  function,  for  example, 
there  might  be  large  differences  In  the  solution  point,  depending  on  small  changes  In  the  starting  point,  the 
data,  or  the  algorithm. 

The  pseudo- Inverse  can  be  a  good  diagnostic  tool  for  getting  the  Information  needed  to  revise  the  problem 
statement,  but  one  should  not  depend  upon  It  to  solve  the  problem  autonomously.  The  analyst's  strong  point  Is 
In  formulating  the  problem;  the  computer's  strength  Is  In  crunching  numbers  to  arrive  at  the  solution.  A 
failure  In  either  role  will  compromise  the  validity  of  the  solution.  This  statement  is  but  a  rephrasing  of 
the  computer  cliche  "garbage  in,  garbage  out,"  which  has  been  said  many  more  times  than  it  has  been  heard. 


Qua si -Newton  Methods 


Quasi-Newton  methods  are  Intended  for  problems  where  explicit  evaluation  of  the  second  gradient  of  the 
cost  function  Is  complicated  or  costly,  but  the  performance  of  the  Newton-Raphson  algorithm  Is  desired.  These 
methods  form  approximations  to  the  second-gradient  matrix  using  the  first-gradient  values  from  several  Itera¬ 
tions.  The  approximation  to  the  second  gradient  then  substitutes  for  the  exact  second  gradient  in  Equa¬ 
tion  (2.4-4).  Some  of  the  methods  directly  form  approximations  of  the  Inverse  of  the  second-gradient  matrix, 
avoiding  the  cost  and  some  of  the  problems  of  matrix  Inversion. 

Note  that  as  long  as  the  approximation  to  the  second-gradient  matrix  is  positive  definite.  Equa¬ 
tion  (2.4-4)  can  never  converge  to  any  point  with  a  nonzero  first  gradient.  Therefore  approximations  to  the 
second  gradient,  no  matter  how  poor,  cannot  affect  the  solution  point.  The  approximations  can  greatly  change 
the  speed  of  convergence  and  the  area  of  acceptable  starting  values.  Approximations  to  the  first  gradient 
would  affect  the  solution  point  as  well. 

The  steepest  descent  method  can  be  considered  as  the  crudest  of  the  quasi-Newton  methods,  using  a  constant 
times  the  identity  matrix  as  the  approximation  to  the  second  gradient.  The  performance  of  the  quasi-Newton 
methods  approaches  that  of  Newton-Raphson  as  the  approximation  to  the  second  gradient  improves.  The 
Davidon-Fletcher-Powell  method  (variable  metric  method)  is  the  most  popular  quasi-Newton  method.  See  the 
references  for  discussions  of  these  methods. 


2.5  SUMS  OF  SQUARES 


The  algorithms  discussed  in  the  previous  sections  are  generally  applicable  to  any  minimization  problem. 

By  tailoring  algorithms  to  special  characteristics  of  specific  problem  classes,  we  can  often  achieve  far 
better  performance  than  by  using  the  general  purpose  algorithms. 

Many  of  the  cost  functions  arising  in  estimation  problems  have  the  form  of  sums  of  squares.  The  general 
sums-of-squares  form  is 

N 

J(x)  =  J2[fi(x)3*WiCfi(x)]  (2.5-1) 

i=i 

The  f-j  are  vector-valued  functions  of  x,  and  the  W-j  are  weightings.  To  simplify  some  of  the  formulae, 
we  assume  that  the  Wj  are  symmetric.  This  assumption  does  not  really  restrict  the  application  because  we  can 
always  substitute  1/2 (W-j  +  Wf)  for  a  nonsynmetric  Wj  without  changing  the  function  values.  In  most  appli¬ 
cations,  the  Wj  are  positive  semi-definite;  this  is  not  a  requirement,  but  we  will  see  that  it  helps  ensure 
that  the  stationary  points  encountered  are  local  minima.  The  form  of  Equation  (2.5-1)  Is  common  enough  to 
merit  special  study. 

The  sunmation  sign  in  Equation  (2.5-2)  is  somewhat  superfluous  in  that  any  function  in  the  form  of  Equa¬ 
tion  (2.5-1)  can  be  rewritten  in  an  equivalent  form  without  the  sunmation  sign.  This  can  be  done  by  concate¬ 
nating  the  N  different  fj(x)  vectors  into  a  single,  longer  f(x)  vector  and  making  a  corresponding  large 
W  matrix  with  the  Wj  matrices  on  diagonal  blocks.  The  only  difference  is  in  the  notation.  We  choose  the 
longer  notation  with  the  summation  sign  because  it  more  directly  corresponds  with  the  way  many  parameter 
estimation  problems  are  naturally  phrased. 


Several  of  the  algorithms  discussed  In  the  previous  two  sections  work  well  with  the  form  of  Equa¬ 
tion  (2.5-1).  For  any  reasonable  fj  functions,  Equation  (2.5-1)  defines  a  cost  function  that  Is  well- 
approximated  by  quadratics  over  fairly  large  regions.  Since  many  of  the  general  minimization  schemes  are 
based  on  quadratic  approximations,  application  of  these  schemes  to  Equation  (2.5-1)  Is  natural.  This  statement 
does  not  Imply  that  there  are  never  problems  minimizing  Equation  (2.5-1);  the  problems  are  sometimes  severe, 
but  the  odds  of  success  with  reasonable  effort  are  much  better  than  they  are  for  arbitrary  cost  function  forms. 
Although  the  general  methods  are  usable,  we  can  exploit  the  problem  structure  to  do  better. 

2.5.1  Linear  Case 

If  the  f{  functions  In  Equation  (2.5-1)  are  linear,  then  the  cost  function  Is  exactly  quadratic  and  we 
can  express  the  minimum  point  In  closed  form.  In  particular,  let  the  f^  be  the  arbitrary  linear  functions 

f^x)  a  A^x  +  bt  (2.5-2) 

Equation  (2.5-1)  then  becomes 

N 

J(x)  -  £tAlx  ♦  b^CAjX  +  b.j]  (2.5-3) 

1"i 

Equating  the  gradient  of  Equation  (2.5-3)  to  zero  gives 

N 

2  ^[AlX  +  b1]*W1A1  -  0  (2.5-4) 

1«i 


Solving  for  x  gives 


assuming  that  the  Inverse  exists. 


(2.5  5) 


If  the  Inverse  exists,  then  Equation  (2.5-5)  gives  the  only  stationary  point  of  Equation  (2.5-3).  This 
stationary  point  must  be  a  minimum  If  all  tne  W-|  are  positive  semi -definite,  and  It  must  be  a  maximum  If  all 
the  Wi  are  negative  semi -definite.  (We  leave  the  straightforward  proofs  as  an  exercise.)  If  the  Wi  meet 
neither  of  these  conditions,  the  stationary  point  can  be  a  minimum,  a  maximum,  or  a  saddle  point. 

If  the  Inverse  In  Equation  (2.5-5)  does  not  exist,  then  there  Is  a  line  (at  least)  of  solutions  to  Equa¬ 
tion  (2.5-4).  All  of  these  points  are  stationary  points  of  the  cost  function.  Use  of  a  pseudo- Inverse  will 
produce  the  solution  with  minimum  norm,  but  this  is  usually  a  poor  Idea  (see  Section  2.4.3). 

2.5.2  Nonlinear  Case 

If  the  fj  are  nonlinear,  there  Is  no  simple,  closed-form  solution  like  Equation  (2.5-5).  A  natural 
question  In  such  situations.  In  which  there  Is  an  easy  method  to  handle  linear  equations,  Is  whether  we  can 
merely  linearize  the  nonlinear  equations  and  use  the  linear  methodology.  Such  linearization  does  not  give  an 
acceptable  closed-form  solution  to  the  current  problem,  but  It  does  form  the  basis  for  an  Iterative  method. 

Define  the  linearization  of  fi  about  any  point  xj  as 

fjj)(x)  =  Ajj)x  +  bjj)  (2.5-6) 

where 

A1j)  2  Vl(xj>  (2.5— 7a) 

b1j)  1  W  '  A1d)xj  (2 . 5— 7b) 

Equation  (2.5-5),  with  the  a{^  and  b(^  substituted  for  A^  and  bj,  gives  the  stationary  point  of  the  cost 
with  the  linearized  ti  functions.  This  point  Is  not.  In  general,  a  solution  to  the  nonlinear  problem.  If, 
however,  x-|  Is  close  to  the  solution,  then  Equation  (2.5-5;  should  give  a  point  closer  to  the  solution, 
because  the  linearization  will  give  a  good  representation  of  the  cost  function  In  the  region  around  x^. 

The  Iterative  algorithm  resulting  from  this  concept  Is  as  follows:  First,  choose  a  starting  value  x0. 
The  closer  x0  Is  to  the  correct  solution,  the  better  the  algorithm  Is  likely  to  work.  Then  define  revised 
Xj  values  by 
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This  equation  corns  from  substituting  Equation  (2.5-7)  Into  Equation  (2.S-S)  and  simplifying.  Iterate  Equa¬ 
tion  (2.5-8)  until  It  converges  by  som  criterion,  or  until  you  give  up.  This  method  Is  often  called  quasi- 
linearization  because  It  Is  based  on  linearization  not  of  the  cost  function  Itself,  but  of  factors  In  the 
cost  function. 

Me  made  several  vague,  unsupported  statements  In  the  process  of  deriving  this  algorithm.  We  now  need  to 
analyze  the  algorithm's  performance  and  compare  It  with  the  performance  of  the  algorithms  discussed  In  the 
previous  sections.  This  task  1$  greatly  simplified  by  noting  that  Equation  (2.5-8)  defines  a  quasi-Newton 
algorithm.  To  she*  this,  we  can  write  the  first  and  second  gradients  of  Equation  (2.5-1): 


V<x>  “  2  ]^tf  1  (x)]*W1  Vxf  1  (jc) 


(2.5-9) 
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7jJ(x)  -  2  ^[^(xJl^tv/^x)]  +  2  ^[f^xN^vJf^x) 

1“i  1-i 


(2.5-10) 


(We  have  not  previously  Introduced  the  definition  of  the  second  gradient  of  a  vector,  as  In  the  v$fj(x) 
above.  The  result  Is  technically  a  tensor,  but  we  will  not  need  to  consider  It  In  detail  here.)  Comparing 
Equation  (2.5-8)  with  Equations  (2.4-4),  (2.5-9),  and  (2.5-10),  we  see  that  the* only  difference  between  quasi  - 
linearization  and  Newton-Raphson  Is  that  quasl-llnearlzatlor.  has  dropped  the  second  term  In  Equation  (2.5-10). 
Quasl-llnearlzatlon  Is  thus  a  quasi-Newton  method  using 


v;j(x)  ■  2 

£[Vi<x>]VVi(x>3 

i-i 


(2.5-11) 


as  an  approximation  for  the  second  gradient.  The  algorithm  In  this  form  Is  also  known  as  Gauss-Newton,  the 
term  we  will  adopt  In  this  book. 

Near  the  solution,  the  neglected  term  of  the  second  gradient  Is  generally  small.  Section  5.4.3  outlines 
this  argument  as  It  applies  to  the  parameter  estimation  problem.  Therefore,  Gauss-Newton  approaches  the  excel¬ 
lent  performance  of  Newton-Raphson  near  the  solution.  Such  approximation  Is  the  main  goal  of  quasi-Newton 
methods . 

Accurately  approximating  the  performance  of  Newton-Raphson  far  from  the  minimum  Is  not  of  great  concern 
because  Newton-Raphson  does  not  generally  perform  well  In  regions  far  from  the  minimum.  We  can  even  argue  that 
Gauss-Newton  sometimes  performs  tetter  than  Newton-Raphson  far  from. the  minimum.  The  worst  problems  with 
Newton-Raphson  occur  when  the  second  gradient  matrix  has  negative  aigenvalues;  Newton-Raphson  can  then  go  In 
the  wrong  direction,  possibly  converging  to  a  local  maximum  or  diverging.  If  all  of  the  Wj  are  positive 
seml-deflnlte  (which  Is  usua^y  the  case),  then  ths  second  gradient  approximation  given  by  Equation  (2.5-11) 

Is  positive  seml-deflnlte  for  all  x.  A  positive  seml-deflnlte  second  gradient  approximation  does  not  guaran¬ 
tee  good  behavior,  but  It  surely  helps;  negative  eigenvalues  virtually  guarantee  problems.  Thus  we  can  heurls- 
tically  argue  that  Ganss-Newton  should  perform  better  than  Newton-Raphson.  We  will  not  attempt  a  detailed 
support  of  this  general  argument  In  this  book.  In  several  specific  cases  the  Improvement  of  Gauss-Newton  over 
Newton-Raphson  Is  easily  demonstrable. 

Although  Gauss-Newton  sometimes  performs  better  than  Newton-Raphson  far  from  the  solution.  It  has  many  of 
the  same  basic  start-up  problems.  Both  algorithms  exhibit  their  best  performance  near  the  minimum.  Therefore, 
we  will  often  need  to  begin  with  some  other,  more  stable  algorithm,  changing  to  Gauss-Newton  as  we  near  the 
minimum. 

The  real  argument  in  favor  of  Gauss-Newton  over  Newton-Raphson  is  the  lower  computational  effort  and  com¬ 
plexity  of  Gauss-Newton.  Any  performance  improvement  is  a  coincidental  side  benefit.  Equation  (2.5-11) 
Involves  only  first  derivatives  of  fi(x).  These  first  derivatives  are  also  used  in  Equation  (2.5-9)  for  the 
first  gradient  of  the  cost.  Therefore,  after  computing  the  first  gradient  of  J,  the  only  significant  computa¬ 
tion  remaining  for  the  Gauss-Newton  approximation  Is  the  matrix  multiplication  In  Equation  (2.5-11).  The  com¬ 
putation  of  the  Gauss-Newton  approximation  for  the  second  gradient  can  sometimes  take  less  time  than  the  compu¬ 
tation  of  the  first  gradient,  depending  on  the  system  dimensions.  For  complicated  f^  functions,  evaluation 
of  the  vjf-|(x)  1r  Equation  (2.5-10)  Is  a  major  portion  of  the  computation  effort  of  the  full  Newton-Raphson 
algorithm.  Gauss-Newton  avoids  this  extra  effort,  obtaining  the  performance  per  iteration  of  Newton-Raphson 
(If  not  better  In  some  areas)  with  computational  effort  per  Iteration  comparable  to  gradient  methods. 

Considering  the  cost  of  the  one-dimensional  searches  required  by  gradient  methods,  Gauss-Newton  can  even 
be  cheaper  per  Iteration  than  gradient  methods.  The  exact  trade-off  depends  on  the  relative  costs  of  evaluat¬ 
ing  the  fi  and  their  gradients,  and  on  the  typical  number  of  evaluations  required  In  the  one-dimensional 
searches.  Gauss-Newton  Is  at  Its  best  when  the  cost  of  evaluating  the  fj  Is  nearly  as  much  as  the  cost  of 
evaluating  both  the  fj  and  their  gradients  due  to  high  overhead  costs  cornnon  to  both  evaluations.  This  Is 
exactly  the  case  in  some  aircraft  applications,  where  the  overhead  consists  largely  of  dimensional Izlng  the 
derivatives  and  building  new  system  matrices  at  each  time  point. 

The  other  quasi-Newton  methods,  such  as  Davldon-Fletcher-Powell ,  also  approach  Newton-Raphson  performance 
without  evaluating  the  second  derivatives  of  the  fy.  These  methods,  however,  do  require  one-dimensional 
searches.  Gauss-Newton  stands  almost  alone  In  avoiding  both  second  derivative  evaluations  and  one-dimensional 
searches.  This  performance  Is  difficult  to  match  In  general  algorithms  that  do  not  take  advantage  of  the 
special  structure  of  the  cost  function. 
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Some  analysts  (Foster,  1983)  Introduce  one-dimensional  line  searches  Into  the  Gauss-Newton  algorithm  to 
Improve  Its  performance.  The  utility  of  this  Idea  depends  on  how  well  the  Gauss-Newton  method  Is  performing. 
In  most  of  our  experience,  Gauss-Newton  works  well  enough  that  the  one-dimensional  line  searches  cannot  mea¬ 
surably  Improve  performance;  the  total  computation  time  can  well  be  larger  with  the  line  searches.  Mhen  the 
Gauss-Newton  algorithm  1$  performing  poorly,  however,  such  line  searches  could  help  stabilize  It. 

For  cost  functions  In  the  form  of  Equation  (2.5-1),  the  cost/performance  ratio  of  Gauss-Newton  Is  so  much 
better  than  that  of  most  other  algorithms  that  Gauss-Newton  Is  the  clearly  preferred  algorithm.  You  may  want 
to  modify  Gauss-Newton  for  specific  problems,  and  you  will  almost  surely  need  to  use  some  special  start-up 
algorithm,  but  the  best  methods  will  be  based  on  Gauss-Newton . 


2.6  CONVERGENCE  IMPROVEMENT 

Second-order  methods  tend  to  converge  quite  rapidly  In  regions  where  they  work  well.  There  Is  usually 
such  a  region  around  the  minimum  point;  the  size  of  the  region  Is  problem-dependent.  The  price  paid  for  this 
region  of  excellent  convergence  Is  that  the  second-order  methods  often  converge  poorly  or  diverge  In  regions 
far  from  the  minimum.  Techniques  to  detect  and  remedy  such  convergence  problems  are  an  Important  part  of  the 
practical  Implementation  of  second-order  methods.  In  this  section,  we  briefly  list  a  few  of  the  many  conver¬ 
gence  Improvement  techniques. 

Modifications  to  Improve  the  behavior  of  second-order  methods  In  regions  far  from  the  minimum  almost 
Inevitably  slow  the  convergence  In  the  region  near  the  minimum.  This  reflects  a  natural  trade-off  between 
speed  and  reliability  of  convergence.  Therefore,  effective  Implementation  of  convergence-improvement  tech¬ 
niques  usually  Includes  different  treatment  of  regions  far  from  the  minimum  and  near  the  minimum. 

In  regions  far  from  the  minimum,  the  second-order  methods  are  modified  or  abandoned  In  favor  of  more  con¬ 
servative  algorithms.  In  regions  near  the  minimum,  there  Is  a  transition  to  the  fast  second  order  methods. 

The  means  of  determining  when  to  make  such  transitions  vary  widely.  Transitions  can  be  base>!  on  a  simple 
Iteration  count,  on  adaptive  criteria  which  examine  the  observed  convergence  behavior,  or  on  other  principles. 
Transitions  can  be  either  gradual  or  step  changes. 

Some  convergence  Improvement  techniques  abandon  second-order  methods  In  the  regions  far  from  the  minimum, 
adopting  gradient  methods  Instead.  In  our  experience,  the  pure  gradient  method  Is  too  slow  for  practical  use 
on  most  parameter  estimation  problems.  Accelerated  gradient  methods  such  as  PARTAN  and  conjugate  gradient  are 
reasonabl e  possibilities. 

Other  convergence  Improvement  techniques  are  modifications  of  the  second-order  methods.  Many  convergence 
problems  relate  tc  Ill-conditioned  or  nonpositive  second  gradient  matrices.  This  suggests  such  modifications 
as  adding  positive  definite  matrices  to  the  second  gradient  or  using  rank-deficient  solutions. 

Constraints  on  the  allowable  range  of  estimates  or  on  the  change  per  Iteration  can  also  have  stabilizing 
effects.  A  particularly  popular  constraint  Is  to  fix  some  of  the  ordinates  at  constant  values,  thus  reducing 
the  dimension  of  the  optimization  problem;  this  Is  a  form  of  axial  Iteration,  and  Its  effectiveness  depends  on 
a  wise  (or  lucky)  choice  of  the  ordinates  to  be  constrained. 

Relaxation  methods,  which  reduce  the  Indicated  parameter  changes  by  some  fixed  percentage,  can  sometimes 
stabilize  oscillating  behavior  of  the  algorithm.  Line  searches  In  the  Indicated  direction  extend  this  concept 
and  should  be  capable  of  stabilizing  almost  any  problem,  at  the  cost  of  additional  function  evaluations. 

The  above  list  of  convergence  Improvement  techniques  Is  far  from  complete.  It  also  omits  mention  of 
numerous  important  Implementation  details.  This  list  serves  only  to  call  attention  to  the  area  of  convergence 
Improvement.  See  the  references  for  more  thorough  treatments. 


Figure  (2.0-1).  Illustration  of  local  and  global  minima. 


Figure  (2.2-2).  The  pattern  direction. 
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Figure  (2.3-2).  The  gradient  direction  near 
a  narrow  valley. 
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Figure  (2.3-3).  Behavior  of  the  gradient  algorithm 
In  a  narrow  valley. 


4' 

2- 

■ 

- 

-2- 

Figure  (2.3-1).  The  gradient  direction  fron  a 
circular  Isocline. 
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Figure  (2.3-4).  Worse  behavior  of  the  gradient 
algorithm. 
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CHAPTER  3 


3.0  BASIC  PRINCIPLES  FROM  PROBABILITY 

In  this  chapter  wa  Mill  review  som  basic  definitions  and  results  fro*  probability  theory.  Wa  presume 
that  tha  raadar  has  had  previous  axposura  to  this  material.  Our  ate  hart  Is  to  review  and  serve  as  a  refer- 
ence  for  thosa  concapts  that  ara  usad  axtanslvaly  In  tha  following  chapters.  Tha  treatment,  that  a  fora,  is 
qulta  abbravlatad,  and  davotas  1 1ttla  tlaa  to  Motivating  tha  flald  of  studjy  or  philosophizing  about  tha 
rasults.  Proofs  of  savaral  of  tha  statements  ara  omitted.  Soma  of  tha  other  proofs  ara  Maraly  outllnad,  with 
some  of  tha  aora  tedious  stops  omitted.  Apostol  (1969),  Ash  (1970),  and  Papoulls  (1965)  glva  aora  dote 11 ad 
traataant. 


3.1  PROBABILITY  SPACES 

3.1.1  Probability  Trlpla 

A  probability  spaco  Is  foraally  daflnad  by  thraa  Items  (o.p.P),  soaatlaos  callod  tha  probability  trlpla. 
o  Is  callod  tha  saapla  spaco,  and  tha  elements  m  of  a  ara  callod  outcoaas  or  raallzatlons.  p  Is  a  sat  of 
sots  daflnad  on  a,  closod  undor  countable  sat  operations  (union,  Intartactlon,  and  complement).  Each  sat 
B  c  B  Is  called  an  event.  In  tha  currant  discussion,  mo  will  not  be  concerned  with  tha  fine  details  of  tha 
definition  of  p.  e  Is  referred  to  as  tha  class  of  Measurable  sets  and  Is  studied  In  Measure  thaory  (Hoyden, 
1968;  Rudln,  1974).  P  Is  a  scalar  valued  function  daflnad  on  p,  and  Is  called  tha  probability  function  or 
probability  Measure.  For  each  sat  B  In  p,  tha  function  P(B)  defines  tha  probability  that  m  will  be  In  B. 
P  Must  satisfy  tha  following  axioms: 

1)  0  s  P(B)  c  1  for  all  B  6  p 

2)  P(Q)  -  1 

3)  P^.  B^  ■  ^  P(Bj)  for  all  countable  sequences  of  disjoint  B1  e  p 

3.1.2  Conditional  Probabilities 

If  A  and  B  are  two  events  and  P(B)  f  0,  the  conditional  probability  of  A  given  B  Is  defined  as 

P(AjB)  -  P(A|B)/P(B)  (3.1-1) 

where  A|B  Is  the  set  Intersection  of  the  events  A  and  B. 

The  events  A  and  B  are  statistically  Independent  If  P(A|B)  ■  P(A).  Note  that  this  condition  Is  sym¬ 
metric;  that  Is,  If  P(A|B)  -  P(A),  then  P(B|A)  »  P(B),  provided  that  P(A|B)  and  P(B|A)  are  both  defined. 


3.2  -  UAR  RANDOM  VARIABLES 

A  scalar  real-valued  function  X(u)  defined  on  R  Is  called  a  random  variable  If  the  set  {u:X(u)  <  x)  Is 
In  for  all  rea1  x. 

&  '  ■  Distribution  and  Density  Functions 

sr>  random  variable  has  a  distribution  function  defined  as  follows: 

Fx(x)  -  P({u:X(u>)  s  x»  (3.2-1) 

It  folio'  directly  from  the  properties  of  a  probability  measure  that  F*(x)  must  be  a  nondecreasing  function 
of  x,  ►  .h  Fyv— )  ■  0  and  Fx(«*)  ■  1.  By  the  Lebesque  decomposition  learn  (Royden,  1968,  p.  240;  Rudln, 
1974,  p.  129),  any  distribution  function  can  always  be  written  as  the  sum  of  a  differentiable  component  and  a 
compo.  which  Is  piecewise  constant  with  a  countable  number  of  discontinuities.  In  many  cases,  we  will  be 
concerned  with  variables  with  differentiable  distribution  functions.  For  such  random  variables,  we  define  a 
function,  px(s),  called  the  probability  density  function,  to  be  the  derivative  of  the  distribution  function: 

px(x)  -  £  Fx(x)  (3.2-2) 

Uc  have  also  the  Inverse  relationship 

Fx(x)  -  f*  px(s)ds  (3.2-3) 


A  probability  density  function  must  be  nonnegative,  and  Its  Integral  over  the  real  line  must  equal  1.  For 
simplicity  of  notation,  we  will  often  shorten  py(s)  to  p(x)  where  the  meaning  Is  clear.  Where  confusion  Is 
possible,  we  will  retain  the  longer  notation. 

A  probability  distribution  can  be  defined  completely  by  giving  either  the  distribution  function  or  the 
density  function.  We  will  work  mainly  with  density  functions,  except  when  they  are  not  defined. 
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3.2.2  Expectations  and  Mwwnts 

The  expected  value  of  a  random  variable.  X.  Is  defined  by 


E(X)  ■  J  xpx(x)dx 


(3.2-4) 


If  X  does  not  have  a  density  function,  the  precise  definition  of  the  expectation  Is  somewhat  more  technical. 
Involving  a  Stleltjes  Integral;  Equation  (3.2-4)  1$  adequate  for  the  needs  of  this  document.  The  expected 
value  Is  also  called  the  expectation  or  the  mean.  Any  (measurable)  function  of  a  random  variable  Is  also  a 
random  variable  and 


E(f(X)}  f"  f(x)px(x)dx 


(3.2-5) 


The  expected  value  of  Xn  for  positive  n  Is  called  the  nth  moment  of  X.  Under  mild  conditions,  knowledge 
of  all  of  the  moments  of  a  distribution  Is  sufficient  to  define  the  distribution  (Papoulls,  1965,  p.  158). 

The  variance  of  X  Is  defined  as 

var(X)  s  £{(x  -  E(X»*> 

-  E(X*}  ♦  E(X}*  -  2E{X)E(X> 

-  E(X*}  -  E(X>*  (3.2-6) 

The  standard  deviation  Is  the  square  root  of  the  variance. 

3.3  JOINT  RANDOM  VARIABLES 

Two  random  variables  defined  on  the  same  sample  space  are  called  joint  random  variables. 

3-3.1  Distribution  and  Density  Functions 

If  two  random  variables,  X  and  Y,  are  defined  on  the  same  sample  space,  we  define  a  joint  distribution 
function  of  these  variables  as 

Fx>y(x.y)  -  P({u:X(u)  <  x,  Y(«)  <  Y})  (3.3-1) 

For  absolutely  continuous  distribution  functions,  a  joint  probability  density  function  p*.Y(x»y)  defined 
by  the  partial  derivative 

PX,Y<X*>>  "  j£y  FX,Y<X’*> 

He  then  have  also 

«x  «y 

FXY(x,y)  *  t  J  Px>Y(s,t)dt  ds 

In  a  similar  manner,  joint  distributions  and  densities  of  N  random  variables  can  be  defined, 
scalar  case,  the  joint  density  function  of  N  random  variables  must  be  nonnegative  and  Its  Integral  over  the 
entire  space  must  equal  1. 

A  random  N-vector  Is  the  same  as  N  jointly  random  scalar  variables,  the  only  difference  being  In  the 
terminology. 

3.3.2  Expectations  and  Moments 

The  expected  value  of  a  random  vector  X  Is  defined  as  In  the  scalar  case: 


E(X) 


Jf  xpx(x)ds 


(3.3-4) 


The  covariance  of  X  Is  a  matrix  defined  by 

cov(X)  -  E([X  -  E(X)][X  -  E(X)]*> 
-  E{XX*}  -  E(X)E(X}* 


(3.3-5) 


The  covariance  matrix  1$  always  symmetric  and  positive  semi -definite.  It  Is  positive  definite  If  X  has  a 
density  function.  Higher  order  moments  of  random  vectors  can  be  defined,  but  are  notatlonally  clumsy  and 
seldom  used. 

Consider  a  random  vector  Y  given  by 
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Y  -  AX  +  b 


(3.3-6) 


where  A  Is  any  deterministic  matrix  (not  necessarily  square) ,  and  b  Is  an  appropriate  length  deterministic 
vector.  Then  tne  mean  and  covariance  of  Y  are 


E{Y)  -  E(AX  +  b)  -  AE{X)  +  b 
cov(Y)  -  E{ [V  -  E(Y)][Y  -  E(Y)]*1 

-  E-C [AX  ♦  b  -  AE(X)  -  b][AX  +  b  -  AE(X)  -  b]*> 

-  AE{[X  -  E(X)][X  -  E(X)]*> 

■  A  cov(X)A* 


(3.3-7) 


(3.3-8) 


3.3.3  Marginal  and  Conditional  Distributions 


If  X  and  Y  are  jointly  random  variables  with  a  joint  distribution  function  given  by  Equation  (3.3-1) , 
then  X  and  Y  are  also  Individually  random  variables,  with  distribution  functions  defined  as  In  Equa¬ 
tion  (3.2-1).  The  Individual  distributions  of  X  and  Y  are  called  the  marginal  distributions,  and  the  corre¬ 
sponding  density  functions  are  called  marginal  density  functions. 

The  marginal  distributions  of  X  and  Y  can  be  derived  from  the  joint  distribution.  (Note  that  the  con¬ 
verse  Is  false  without  additional  assumptions.)  By  comparing  Equations  (3.2-1)  and  (3.3-1),  we  obtain 


and  correspondingly 


Fx(x)  •  FXY(x.-) 


Fy(y)  "  FXY(.,y) 


In  terms  of  the  density  functions,  using  Equations  (3.2-2)  and  (3.3-3),  we  obtain 


px(x)  -  £  PXty(*.y)dy 
Pv(y)  •  £  PXtY(x,y)dx 


The  conditional  distribution  function  of  X  given  Y  Is  defined  as  (see  Equation  (3.1-1)) 


(3.3-9a) 


(3.3-9b) 


(3.3-10a) 


(3.3-10b) 


FX|Y(x|y)  "  P«*»:X(w)  «  x)|{«:Y(u)  <  y»  (3.3-11) 

and  correspondingly  for  Fy|X.  The  conditional  density  function,  when  It  exists,  can  be  expressed  as 

PX|Y(x|y)  -  PXfy(x,y)/pY(y)  (3.3-12) 

Equation  (3.3-12)  Is  known  as  Bayes*  rule. 

The  conditional  expectation  Is  defined  as 


E{X|Y)  -  J  xpX|y(x|y)dx 


(3.3-13) 


assuming  that  the  density  function  exists.  Using  Equation  (3.3-13),  we  obtain  the  useful  decomposition 

E(f (X,Y)}  -  E{E(f(X,Y) | Y)}  (3.3 


(3.3-14) 


3.3.4  Statistical  Independence 


Two  random  vectors  X  and  Y  defined  on  the  same  probability  space  are  defined  to  be  Independent  If 


Fx,Y(**y)  "  Fx(x)Fy(y) 

If  the  joint  probability  density  function  exists,  we  can  write  this  condition  as 

Px,¥(x.y)  "  Px(x)pY(y) 


(3.3-15) 


(3.3-16) 


An  Immediate  corollary,  using  Equation  (3.3-12),  Is  that  pX|v  does  not  depend  on  y,  and  py I x  ^°*s  not 
depend  on  x.  If  X  and  Y  are  Independent,  then  f(x)  and  g(Y)  are  Independent  for  any  functions  f  and  g. 

Two  vectors  are  uncorrelated  If 


E{XY*>  -  E<X}E{Y#) 


(3.3-17) 
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or  equivalently  If 


E{ (X  -  E{X))(Y  -  E{Y})*>  -  0 


(3.3-18) 


If  X  and  Y  are  uncorrelated,  then  the  covariance  of  their  sun  equals  the  sun  of  their  covariances. 

cov(X  ♦  Y)  ■  cov(X)  ♦  cov(Y)  (3.3-19) 

If  two  vectors  are  Independent,  then  they  are  uncorrelated.  but  the  converse  of  this  statement  Is  false. 
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3.4  TRANSFORMATION  OF  VARIABLES 


A  large  part  of  probability  theory  1$  concerned  In  some  manner  with  the  transformation  of  variables;  l.e.. 
characterizing  random  variables  defined  as  functions  of  other  random  variables.  Me  have  previously  cited 
limited  results  on  the  means  and  covariances  of  some  transformed  variables  (Equations  (3.2-5),  (3.3-7),  and 
(3.3-8)).  In  this  section  we  seek  the  entire  density  function.  Our  consideration  Is  restricted  to  variables 
that  have  density  functions.  Let  X  be  a  random  vector  with  density  function  px(x)  defined  on  Rf,,  the 
Euclidean  space  of  real  n-vectors.  Then  define  Ye  by  Y  ■  f(X).  Me  seek  to  derive  the  density  func¬ 

tion  of  Y.  There  are  three  cases  to  consider,  depending  on  whether  m  *  n,  m  >  n,  or  m  <  n. 

The  primary  case  of  Interest  Is  when  m  ■  n.  Assume  that  f(<)  Is  Invertible  and  has  continuous  partial 
derivatives.  (Technically,  this  Is  only  required  almost  everywhere.)  Define  g(Y)  *  f"l(Y).  Then 

PY(y)  •  Px(9(y))|<tet(.l)|  (3.4-1) 


where  <1  Is  the  Jacobian  of  the  transformation  g 

994 (y) 

JiJ-1yT 

See  Rudln  (1974,  p.  186)  and  Apostol  (1969,  p.  394)  fo*>  the  proof. 

Example  3.4-1  Let  Y  «  CX,  with  C  square  and  nonsingular.  Then  g(y)  ■  C_1y 
and  J  ■  C-i,  giving 

pY(y)  ■  px(C'1y)|det(cr‘)| 


(3.4-2) 


as  the  transformation  equation. 

If  f  Is  not  Invertible,  the  distribution  of  Y  Is  given  by  a  sum  of  terms  similar  to  Equation  (3.4-1). 

For  the  case  with  m  >  n,  the  distribution  of  Y  will  be  concentrated  on,  at  most,  an  n-dlmenslonal 
hypersurface  In  Rn,,  and  will  not  have  a  density  function  In  Rri. 

The  simplest  nontrivial  case  of  m  <  n  Is  when  Y  consists  of  a  subset  of  the  elements  of  X.  In  this 
case,  the  density  function  sought  Is  the  density  function  of  the  marginal  distribution  of  the  pertinent  subset 
of  the  elements  of  X.  Marginal  distributions  were  discussed  In  Section  3.3.3.  In  general,  when  m  <  n, 

X  can  be  transformed  Into  a  random  vector  Z  e  Rn,  such  that  Y  Isa  subset  of  the  elements  of  Z. 


Example  3.4-2  Let  X  e  R,  and  Y  ■  Xt  +  X4.  Define  Z  ■  CX  where 


Then  using  example  3.4-1, 


p2(z)  »  Px(C"l)|det(C*l)| 
-  |  Px(C-1z) 


where 


Then  Y  ■  Z,,  so  the  distribution  of  Y  Is  the  marginal  distribution  of  Z,,  which  can  be  computed  from 
Equation  (3.3-10). 


3.5  GAUSSIAN  VARIABLES 

Random  variables  with  Gaussian  distributions  play  a  major  role  In  this  document  and  in  much  of  probability 
theory.  Me  will,  therefore,  briefly  review  the  definition  and  some  of  the  salient  properties  of  Gaussian  dis¬ 
tributions.  These  distributions  are  often  called  normal  distributions  In  the  literature. 


3.5.1  Standard  Gaussian  Distributions 


All  Gaussian  distributions  darlva  from  the  distribution  of  a  standard  Gaussian  varlabla  with  maan  0  and 
covariance  1.  The  density  function  of  the  standard  Gaussian  distribution  1$  defined  to  be 

p(x)  -  (2„rl/‘  exp(-  |  x»)  (3.5-1) 

The  distribution  function  does  not  have  a  simple  closed-form  expression.  We  will  first  show  that  Equa¬ 
tion  (3.5-1)  Is  a  valid  density  function  with  mean  0  and  covariance  1.  The  most  difficult  part  Is  showing 
that  Its  Integral  over  the  real  line  Is  1. 

Theorem  3.5-1  Equation  (3.5-1)  defines  a  valid  probability  density  function. 

Proof  The  function  Is  obviously  nonnegative.  There  remains  only  to  show 
that  Its  Integral  over  the  real  line  Is  1.  Taking  advantage  of  the  syirmetry 
about  0,  we  can  reduce  this  problem  to  proving  that 


jr  .«p(-  )  «■) 


dx  -  /^7?T 


There  Is  no  closed-form  expression  for  this  Integral  over  any  finite  range, 
but  for  the  semi-infinite  range  of  Equation  (3.5-2)  the  following  "trick" 
works.  Form  the  square  of  the  Integral: 

jj^  exp^-  |  x2jdxj  -  Jf  Jf  exp£-  |  (x‘  +  y2)jdx  dy 

Then  change  variables  to  polar  coordinates,  substituting  r2  for  x2  +  y2 
and  r  dr  de  for  dx  dy,  to  get 

[jf «4 1 x')<"']  ■  {'  i'r ewf ^ r'h “ 

The  Integral  In  Equation  (3.5-4)  has  a  closed-form  solution: 

J  r  exp^-  £  r2jdr  ■  -exp^-  ^  r2j|  -  0  -  (-1)  ■  1 

Thus, 


[r  ,,p('  7  x')dx]  ■  ^ 


1  de  -  | 


(3.5-2) 


(3.5-3) 


(3.5-4) 


(3.5-5) 


(3.5-6) 


Taking  the  square  root  gives  Equation  (3.5-2),  completing  the  proof. 

The  mean  of  the  distribution  Is  trivially  aero  by  symmetry.  To  derive  the  covariance,  note  that 

EU  -  X2}  «  Jf"(l  -  x2)(2v)-1/2  exp^-  |  x2^dx  -  (2*)'l/2x  exp^-  \  x2j|  -  0  (3.5-9) 


Thus, 


cov(X)  »  E{X2}  -  E{ X }l  -1-0-1 


(3.5-10) 


This  completes  our  discussion  of  the  scalar  standard  Gaussian. 


We  define  a  standard  multivariate  Gaussian  vector  to  be  the  concatenation  of  n  Independent  standard 
Gaussian  variables.  The  standard  multivariate  Gaussian  density  function  is  therefore  the  product  of  n 
marginal  density  functions  In  the  form  of  Equation  (3.5-1). 


P(x)  ■ 


(3.5-11) 


The  mean  of  this  distribution  Is  0  and  the  covariance  is  an  Identity  matrix. 


3.5.2  General  Gaussian  Distributions 


We  will  define  the  class  of  all  Gaussian  distributions  by  reference  to  the  standard  Gaussian  distributions 
of  the  previous  section.  We  define  a  random  vector  Y  to  have  a  Gaussian  distribution  If  Y  can  be  repre¬ 
sented  in  the  form 


29 


Y  ■  AX  ♦  m  (3.5-12) 

where  X  Is  a  standard  Gaussian  vector,  A  Is  a. deterministic  matrix  ar.d  i  Isa  deterministic  vector.  The 
A  matrix  need  not  be  square.  Note  that  any  deterministic  vector  Is  a  special  case  of  a  Gaussian  vector  with 
a  zero  A  matrix. 

We  have  defined  the  class  of  Gaussian  random  variables  by  a  set  of  operations  that  can  produce  such 
variables.  It  now  remains  to  determine  the  forms  and  properties  of  these  distributions.  (This  Is  somewhat 
backwards  from  the  most  coamion  approach,  where  the  forms  of  the  distributions  are  first  defined  and  Equa¬ 
tion  (3.5-12)  Is  proven  as  a  result.  Me  find  that  our  approach  makes  It  somewhat  easier  to  handle  singular 
and  nonsingular  cases  consistently  without  Introducing  characteristic  functions  (Pepoulls,  19(5). 

By  Equations  (3.3-7)  and  (3.3-B),  the  Y  defined  by  Equation  (3.5-12)  has  amen  m  and  covariance  AA*. 
Our  first  major  result  will  be  to  show  that  a  Gaussian  distribution  Is  uniquely  specified  by  Its  mean  and 
covariance;  that  Is,  If  two  distributions  are  both  Gaussian  and  have  equal  means  and  covariances,  then  the  two 
distributions  are  Identical.  Note  that  this  does  not  mean  that  the  A  matrices  need  to  be  Identical;  the 
reason  the  result  1$  nontrivial  Is  that  an  Infinite  number  of  different  A  matrices  give  the  same  covariance 
AA*. 


Example  3.5-1  Consider  three  Gaussian  vectors 

.707 

_  -707 
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where  X,  and  X,  are  standard  Gaussian  2-vectors  and  X,  Is  a  standard 
Gaussian  3-vector.  We  have 


cov(Yj) 


cov(Yt) 


cov(Y,) 
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Thus  all  three  Yj  have  equal  covariance. 

The  rest  of  this  section  Is  devoted  to  proving  this  result  In  three  steps.  First,  we  will  consider 
square,  nonsingular  A  matrices.  Second,  we  will  consider  general  square  A  matrices.  Finally,  we  will 
consider  nonsquare  A  matrices.  Each  of  these  steps  uses  the  results  of  the  previous  step. 


Theorem  3.5-2  If  Y  Is  a  Gaussian  n-vector  defined  by  Equation  (3.5-12) 
with  a  nonsingular  A  matrix,  then  the  probability  density  function  of  Y 
exists  and  1$  given  by 

P(y)  *  |2»A|'l/*  exp£-  |  (y  -  m)*A'l(y  -  m)j  (3.5-13) 
where  A  Is  the  covariance  AA*. 


Proof  This  Is  a  direct  application  of  the  transformation  of  variables  for- 
mula.  Equation  (3.4-1). 

pY(y)  *  pxtA'l(y  -  ")]|A‘l| 

-  (2w)-n/*  exp{-  {  [A*My  -  m)]*[A-My  - 

■  |2irAA*|'1^1  exp£-  (y  -  m)*(AA*)_1(y  -  m)J 

Substituting  A  for  AA*  then  gives  the  desired  result. 

Note  that  the  density  function.  Equation  (3.5-13),  depends  only  on  the  mean  and  covariance,  thus  proving 
the  uniqueness  result  for  the  case  restricted  to  nonsingular  matrices.  A  particular  case  of  interest  Is  where 
m  Is  0  and  A  Is  unitary.  (A  unitary  matrix  Is  a  square  one  with  AA*  ■  I.)  In  this  case,  Y  has  a  standard 
Gaussian  distribution. 


(3.5-14) 


Theorem  3.5-3  If  Y  Is  a  Gaussian  n-vector  defined  by  Equation  (3.5-12) 
with  any  squara  A  matrix,  than  Y  can  ba  rapratantad  as 

Y  •  SX  ♦  m 

wbara  X  Is  a  standard  Gaussian  n-vactor  and  S  Is  posltlva  seml-daflnlta. 

Furtharmora.  tha  S  In  this  raprasantatlon  Is  unlqua  and  dapands  only  on 
tha  covarlanca  of  Y. 

Proof  Tha  ur.lquanass  Is  aasy  to  prove,  and  wa  will  do  It  first.  Tha  covarl¬ 
anca  of  tha  Y  given  In  Equation  (3.5-12)  Is  AA*.  Tha  covarlanca  of  a  Y 
oxprassad  as  In  Equation  (3.5-14)  Is  SS*.  A  nacassary  (but  not  sufficient) 
condition  for  Equation  (3.5-14)  to  ba  a  valid  raprasantatlon  of  Y  Is,  there¬ 
fore,  that  SS*  equal  AA*.  It  Is  an  elementary  result  of  llnaar  algebra 
(Wilkinson,  1965;  Dongarra,  Molar,  Bunch,  and  Stewart,  1979;  and  Strang, 

1980)  that  AA*  Is  always  positive  semi -definite  and  that  there  Is  one  and 
only  ona  positive  saml-daflnlta  matrix  S  satisfying  SS*  «  AA*.  S  Is 
called  the  matrix  square  root  of  AA*.  This  proves  the  uniqueness. 

The  existence  proof  relies  on  another  result  fro*  linear  algebra:  any  square 
matrix  A  can  be  factored  as  SQ,  where  S  Is  positive  saml-deflnlte  and  Q 
Is  unitary.  For  nonsingular  A,  this  factorization  Is  easy-S  Is  the  matrix 
square  root  of  AA*  and  Q  Is  S"XA.  A  formal  proof  for  general  A  matrices 
would  ba  too  long  a  diversion  Into  linear  algebra  for  our  current  purposes, 
so  we  will  omit  It.  This  factorization  Is  closely  related  to,  and  can  be 
formally  derived  from,  the  well-known  QR  factorization,  where  Q  Is  unitary 
and  R  Is  upper  triangular  (Wilkinson,  1965;  Dongarra,  Holer,  Bunch,  and 
Stewart,  1979;  and  Strang,  1980). 

Given  the  SQ  factorization  of  A,  define 

X  -  QX  (3.5-15) 

By  theorem  (3.5-2),  X  Is  a  standard  Gaussian  n-vector.  Substituting  Into 
Equation  (3.5-12)  gives  Equation  (3.5-14),  completing  the  proof. 

Because  the  S  In  the  above  theorem  depends  only  on  the  covariance  of  Y,  it  Immediately  follows  that 
the  distribution  of  any  Gaussian  variable  generated  by  a  square  A  matrix  Is  uniquely  specified  by  the  mean 
and  covariance.  It  remains  only  to  extend  this  result  to  rectangular  A  matrices. 

Theorem  3.5-4  The  distribution  of  any  Gaussian  vector  Is  uniquely  defined 
by  Its  mean  and  covariance. 

Proof  We  have  already  shown  the  result  for  Gaussian  vectors  generated  by 
square  A  matrices.  We  need  only  show  that  a  Gaussian  vector  generated  by 
a  rectangular  A  matrix  can  be  rewritten  in  terms  of  a  square  A  matrix. 

Let  A  be  n-by-m,  and  consider  the  two  cases,  n  >  m  and  n  <  m.  If  n  <  m, 
define  a  standard  Gaussian  n-vector  X  by  auipaentlng  the  X  vector  with 
n  -  m  Independent  standard  Gauss  Ians.  Then  define  an  n-by-n  matrix  A  by 
augmenting  A  with  n  -  m  rows  of  zeros.  We  then  have 

Y  -  AX  +  m 


as  desired. 

For  the  case  n  <  m,  define  a  random  m-vector  ?  by  augmenting  Y  with 
m  -  n  zeros.  Then 


Y  -  AX  +  m 

where  m  and  A  are  obtalned.by  augmenting  zeros  to  m  and  A.  Use 
Theorem  (3.5-3)  to  rewrite  Y  as 


Since  the  last 
In  the  form 


Thus 


m  -  n 


Y  -  SX  +  m 

elements  of  ?  are  zero,  Equation  (3.5-16)  must  be 


(3.5-16) 


Cl-C  1:*:] 


SX  +  m 


which  Is  In  the  required  form. 


Theorem  (3.5-4)  Is  the  central  result  of  this  approach  to  Gaussian  variables.  It  makes  the  practical 
manipulation  of  Gaussian  variables  much  easier.  Once  you  have  demonstrated  that  some  result  Is  Gaussian,  you 
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need  only  derive  the  mean  and  covariance  to  specify  the  distribution  completely.  This  Is  far  easier  than 
manipulating  the  full  density  function  or  distribution  function,  a  process  which  often  requires  partial  differ¬ 
ential  equations.  If  the  covariance  matrix  Is  nonsingular,  then  the  density  function  exists  and  Is  given  by 
Equation  (3.5-13).  If  the  covariance  Is  singular,  a  density  function  does  not  exist  (unless  you  extend  the 
definition  of  density  functions  to  Include  components  like  Impulse  functions). 

Two  properties  of  the  Gaussian  density  function  often  provide  useful  computational  shortcuts  to  evaluating 
the  mean  and  covariance  of  nonsingular  Gausslans.  The  first  property  Is  that  the  mean  of  the  density  function 
occurs  at  Its  maximum.  The  mean  Is  thus  the  unique  solution  of 

vy  in  p(y)  -  0  (3.5-17) 

The  logarithm  In  this  equation  can  be  removed,  but  the  equation  Is  usually  most  useful  as  written.  The  second 
property  Is  that  the  covariance  can  be  expressed  as 

cov(Y)  -  -tv*  in  p(y)]*1  (3.5-18) 

Both  of  these  properties  are  easy  to  verify  by  direct  substitution  Into  Equation  (3.5-13). 

3.5.3  Properties 


In  this  section  we  derive  several  useful  properties  of  Gaussian  vectors.  Host  of  these  properties  relate 
to  operations  on  Gaussian  vectors  that  give  Gaussian  results.  A  major  reason  for  the  wide  use  of  Gaussian 
distributions  Is  that  many  basic  operations  on  Gaussian  vectors  give  Gaussian  results,  which  can  be  character¬ 
ized  completely  by  the  mean  and  covariance. 

Theorem  3.5-5  If  Y  is  a  Gaussian  vector  with  mean  m  and  covariance  A, 
and  If  I  Ts  given  by 


Z  -  BY  +  b 

then  Z  Is  Gaussian  with  mean  Bm  +  b  and  covariance  BaB*. 

Proof  By  definition,  Y  can  be  expressed  as 

Y  -  AX  +  m 

where  X  Is  a  standard  Gaussian.  Substituting  Y  Into  the  expression  for 
Z  gives 


Z  =  B(AX  +  m)  +  b  =  BAX  +  (Bm  +  b) 

proving  that  Z  Is  Gaussian.  The  mean  and  covariance  expressions  for  linear 
operations  on  any  random  vector  were  previously  derived  In  Equations  (3.3-7) 
and  (3.3-8). 

Several  of  the  properties  discussed  in  this  section  Involve  the  concept  of  jointly  Gaussian  variables. 

Two  or  more  random  vectors  are  said  to  be  jointly  Gaussian  If  their  joint  distribution  Is  Gaussian.  Note  that 
two  vectors  can  both  be  Gaussian  and  yet  not  be  jointly  Gaussian. 

Example  3.5-2  Let  Y  be  a  Gaussian  random  variable  with  mean  0  and 
variance  1.  Define  Z  as 


Z 


-1  s  Y  s  1 
elsewhere 


The  random  variable  Z  Is  Gaussian  with  mean  0  and  variance  1  (apply  Equation  (3.4-1)  to  show  this), 
but  Y  and  Z  are  not  jointly  Gaussian. 


Theorem  3.5-6  Let  Y2  and  Y,  be  jointly  Gaussian  vectors,  and  let  the  mean 
and  covariance  of  the  joint  distribution  be  partitioned  as 


Then  the  marginal  distributions  of  Yx  and  Y2  are  Gaussian  with 

EtYx  >  =  m2  cov(Yj  =  A21 

E{Y2)  =  m2  cov(Y2)  *  a22 


Proof  Apply  theorem  (3.5-5)  with  B  ■  [1  0]  and  B  =  [0  1]. 

The  following  two  theorems  relate  to  Independent  Gaussian  variables: 

Theorem  3.5-7  If  Y  and  Z  are  two  Independent  Gaussian  variables,  then  Y  and  Z  are  jointly 
Gaussian. 
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Proof  For  nonsingular  distributions,  this  proof  Is-  easy  to  do  by  writing  out 
the  product  of  the  density  functions.  For  a  more  general  proof,  we  can 
proceed  as  follows:  write  Y  and  Z  as 

Y  -  A^  +  m2 

Z  ■  A2X2  +  m2 

where  X2  and  X,  are  standard  Gaussian  vectors.  We  can  always  construct  the 
X2  and  X2  In  these  equations  to  be  Independent,  but  the  following  argument 
avoids  the  necessity  to  prove  that  statement.  Define  two  Independent  standard 
Gausslans,  Xx  and  X2,  and  further  define 

¥  -  A^j  +  m2 

2  *  A2X2  +  m2 

Then  ¥  and.Z  have  the  same  joint  distribution  as  Y  and  Z.  The  concatenation 
of  X,  and  X2  Is  a  standard  Gaussian  vector.  Therefore,  ¥  and  2  are  jointly 
Gaussian  because  they  can  be  expressed  as 


Since  Y  and  Z  have  the  same  joint  distribution  as  ¥  and  Z,  Y  and  Z  are  also 
jointly  Gaussian. 

Theorem  3.5-8  If  Y  and  Z  are  two  uncorrelated  jointly  Gaussian  variables, 
then  Y  and  2  are  Independent  and  Gaussian. 

Proof  By  theorem  (3.5-3),  we  can  express 

-  SX  +  m 

where  X  Is  a  standard  Gaussian  vector  and  S  Is  positive  seml-deflnlte. 
Partition  S  as 


By  the  definition  of  "uncorrelated,"  we  must  have  S22  •  Sj2  *  0.  Therefore, 
partitioning  X  Into  Xx  and  X2,  and  partitioning  m  Into  m2  and  m2,  we 
can  write 


Y  *  SllX1  +  mj 
Z  =  S22X2  +  m2 

Since  Y  and  Z  are  functions  of  the  Independent  vectors  Xx  and  X2,  Y  and  Z 
are  Independent  and  Gaussian. 

Since  any  two  Independent  vectors  are  uncorrelated,  Theorem  (3.5-8)  proves  that  Independence  and  lack  of 
correlation  are  equivalent  for  Gausslans. 

We  previously  covered  marginal  distributions  of  Gaussian  vectors.  The  following  theorem  considers  condi¬ 
tional  distributions.  We  will  directly  consider  only  conditional  distributions  of  nonsingular  Gausslans. 

Since  the  results  of  the  theorem  involve  inverses,  there  are  obvious  difficulties  that  cannot  be  circumvented 
by  avoiding  the  use  of  probability  density  functions  In  the  proof. 

Theorem  3.5-9  Let  Yx  and  Y2  be  jointly  Gaussian  variables  with  a  nonsingu- 
lar  joint  distribution.  Partition  the  mean,  covariance,  and  Inverse  covariance 
of  the  joint  distribution  as 


Then  the  conditional  distributions  of  Yx  given  Y2,  and  of  Y2  given  Ylt  are 


Gaussian  with  means  and  covariances 

E{Y2|Y2>  -  m2  +  A12A^(y2  -  m2)  (3. 5- 18a) 

cov{Y2|Y2)  =  Alx  -  A12A2‘A21  =  (r^)-1  (3. 5- 18b) 

E{Y2|YX}  ■  m2  +  A21A-J(y2  -  m2)  (3.5-19a) 

cov{Y2|Y2}  -  A22  -  A21A22A12  =  (r22)_x  (3.5-19b) 
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Proof  The  joint  probability  density  function  of  Y2  and  Y2  Is 


p(yx.y2)  “  c  exP 


f  i  ■  mi  rn  rii  yx  ■  mi  ^ 

I.7  .y* "  mi.  .r“  r**.  y*  *  ms.  J 


where  c  Is  a  scalar  constant,  the  magnitude  of  which  we  will  not  need  to 
compute.  Expanding  the  exponent,  and  recognizing  that  r2l  «  r12*,  gives 

p(yx.y2)  -  C  exp£-  |  (y2  -  m1)*r11(y1  -  mj 

-  \  (y2  -  m2)*r22(y2  -  ma)  -  (y2  -  mj^r^ty*  -  m.  l] 
Completing  squares  results  In 

p(yx.y2)  ■  C  exp  j- 1  [yx  -  m2  +  r7*r12(y2  -  m1)]*r11[y1  -  m4  +  r7*r12(y2  -  m2)] 

■j  (y2  ■  "'2)*(r22  ~  r21r21r12)(y2  -  m2)J 

Integrating  this  expression  with  respect  to  y2  gives  the  marginal  density 
function  of  Y2.  The  second  term  In  the  exponent  does  not  Involve  yx,  and 
we  recognize  the  first  term  as  the  exponent  In  a  Gaussian  density  function 
with  mean  m2  -  r“xr, 2 (y_  -  m2)  and  covariance  r2,.  Its  Integral  with 
respect  to  y,  Is  therefore  a  constant  Independent  of  y2.  The  marginal 
density  function  of  Y2  Is  therefore 

P(y2)  =  c2  exp[-  j  (y2  -  m2)*(r22  -  r21r7*r12)(y2  -  m2)] 

where  c2  Is  a  constant.  Note  that  because  we  know  that  Equation  (3.5-22) 
must  be  a  probability  density  function,  we  need  not  compute  the  value  of  c2; 
this  saves  us  a  lot  of  work.  Equation  (3.5-22)  Is  an  expression  for  a 
Gaussian  density  function  with  mean  m2  and  covariance  (r22  -  r21r )_1 . 

The  partitioned  matrix  Inversion  lenna  (Appendix  A)  gives  us 

(r22  *  r2iriiri2)  1  *  a22 

thus  independently  verifying  the  result  of  Theorem  (3.5-6)  on  the  marginal 
distribution. 

The  conditional  density  of  Y.  given  Y,  Is  obtained  using  Baves1  rule,  by 
dividing  Equation  (3.5-21)  by  Equation  (3.5-22) 


p(yjy2) 


p(yx.y2) 


where  c2  Is  a  constant.  This  Is  an  expression  for  a  Gaussian  density 
function  with  a  mean  mx  -  r7£r,2(y2  -  m2)  and  covariance  r^.  The  parti¬ 
tioned  matrix  Inversion  lemma  (Appendix  A)  then  gives 


r12  a  -  Al2A22A2X 


rxiri2  =  a12a2J 


(3.5-20) 


(3.5-21) 


(3.5-22) 


■i  exp|-  |  [y1-m1  +  r:‘1r12(y2-m2)]*rll[yl-ml  +  r7Jr12(y2-m2)]|  (3.5-23) 


Thus  the  conditional  distribution  of  Y2  given  Y2  Is  Gaussian  with  mean 
m,  +  A12A:i(y2  -  m2)  and  covariance  An  -  A12A7iA2l,  as  we  desired  to  prove. 

Trie  conditional  distribution  of  Y2  given  Yx  follows  by  symmetry. 

The  final  result  of  this  section  concerns  sums  of  Gaussian  variables. 

Theorem  3.5-10  If  Yj  and  Y,  are  jointly  Gaussian  random  vectors  of  equal 
length  and  their  joint  distribution  has  mean  and  covariance  partitioned  as 

m2  I  Axx  A12| 

m  =  A  = 

m2  j^A21  A22J 

Then  Y2  +  Y2  Is  Gaussian  with  mean  mx  +  m2  and  covariance 
An  +  A22  +  a12  +  a21. 

Proof  Apply  Theorem  (3.5-5)  with  B  *  [I  I]  and  b  =  0. 

A  simple  sumnary  of  this  section  Is  that  linear  operations  on  Gaussian  variables  give  Gaussian  results.  This 
principle  Is  not  generally  true  for  nonlinear  operations.  Therefore,  Gaussian  distributions  are  strongly 
associated  with  the  analysis  of  linear  systems. 


3.5.4  Central  Limit  Theorem 


The  Central  Limit  Theorem  Is  often  used  as  a  basis  for  justifying  the  assumption  that  the  distribution 
of  some  physical  quantity  Is  approximately  Gaussian. 

Theorem  3.6-11  Let  Yx,  Y2...  be  a  sequence  of  Independent,  Identically  distributed  random 
vectors  with  finite  mean  m  and  covariance  A.  Then  the  vectors 

N 

^  .  JL 

1-1 

converge  In  distribution  to  a  Gaussian  vector  with  mean  zero  and  covariance  A. 

Proof  See  Ash  (1970,  p.  171)  and  Apostol  (1969,  p.  567). 

Cramer  (1946)  discusses  several  variants  on  this  theorem,  where  the  Yi  need  not  be  Independent  and  Iden 
tlcally  distributed,  but  other  requirements  are  placed  on  the  distributions.  The  general  result  Is  that  sums 
of  random  variables  tend  to  Gaussian  limits  under  fairly  broad  conditions.  The  precise  conditions  will  not 
concern  us  here.  An  Implication  of  this  theorem  Is  that  macroscopic  behavior  which  is  the  result  of  the 
sunmatlon  of  a  large  number  of  microscopic  events  often  has  a  Gaussian  distribution.  The  classic  example  Is 
Brownian  motion.  We  will  Illustrate  the  Central  Limit  Theorem  with  a  simple  example. 

Example  3.5-3  Let  the  distribution  of  the  Yj  In  Theorem  (3.5-11)  be  uniform 
on  the  Interval  (-1,1).  Then  the  mean  Is  zero  and  the  covariance  is  1/3. 

Examine  the  density  functions  of  the  first  few  Z^. 

The  first  function,  Zlt  Is  equal  to  Y1(  and  thus  is  uniform  on  (-1,1). 

Figure  (3.5-1)  compares  the  densities  of  Zx  and  the  Gaussian  limit.  The 
Gaussian  limit  distribution  has  mean  zero  and  variance  1/3. 


For  the  second  function  we  have 

z*  -  -My,  +  y2) 

/ z 

and  the  density  function  ol  Z2  Is  given  by 

P(zt)  •£(«*-  |x|)  for  |z|<* 

Figure  (3.5-2)  compares  the  density  of  Z2  with  the  Gaussian  limit. 
The  density  function  of  Z,  is  given  by 


P(Zj) 


’^f  (1  -  z2) 

?£  (z*  -  2/5] z |  +3) 


|z|  <  ft 
ft 

1*1  *  * 


Figure  (3.5-3)  compares  density  of  Z9  with  the  Gaussian  limit.  By  the  time 
N  Is  3,  Zn  Is  already  becoming  reasonably  close  to  Gaussian. 


Figure  (3.5-1).  Density  functions  of  7}  and  the  limit  Gaussian. 


Figure  (3.5-2).  Density  functions  of  Z2  and 
the  limit  Gaussian. 


Figure  (3.5-3).  Density  functions  of  Z3  and 
the  limit  Gaussian. 
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CHAPTER  4 


4.0  STATISTICAL  ESTIMATORS 

In  this  chapter,  Me  Introduce  the  concept  of  an  estimator.  He  then  define  some  basic  measures  of  esti¬ 
mator  performance.  We  use  these  measures  of  performance  to  Introduce  several  coemnn  statistical  estimators. 

The  definitions  In  this  chapter  are  general.  Subsequent  chapters  Mill  treat  specific  forms.  For  other 
treatments  of  this  and  related  material,  see  Sorenson  (1980),  SchMeppe  (1973),  GoodMln  and  Payne  (1977),  and 
Eykhoff  (1974).  These  books  also  cover  other  estimators  that  Me  do  not  mention  here. 

4.1  DEFINITION  OF  AN  ESTIMATOR 

The  concept  of  estimation  Is  central  to  our  study.  The  statistical  definition  of  an  estimator  Is  as 
folloMs: 

Perform  an  experiment  (Input)  U,  taken  from  the  set  ©  of  possible  experiments  on  the  system.  The  system 
response  Is  a  random  variable: 


Z  ■  Z(t,U,u)  (4.1-1) 

Mhere  t  6  s  Is  the  true  value  of  the  parameter  vector  and  u  e  a  Is  the  random  component  of  the  system. 

An  estimator  Is  any  function  of  Z  Mlth  range  In  s.  The  value  of  the  function  Is  called  the  estimate 
C.  Thus 


I  -  l(Z.U)  -  e(Z(C.U,w).U)  (4.1-2) 

This  definition  Is  readily  generalized  to  multiple  performances  of  the  same  experiment  or  to  the  performance 
of  more  than  one  experiment.  If  N  experiments  Uj  are  performed,  Mlth  responses  Zj,  then  an  estimate 
Mould  be  of  the  form 


l  -  ?(Z1,...ZN,U1...UN) 

■  C(Z(EiUlluj),...Z((,U||,iii||)lU|...U||)  (4.1-3) 

Mhere  the  ui  are  Independent.  The  N  experiments  can  be  regarded  as  a  single  "super-experiment" 

(U!...Un)  €(ftx<Qix  ...  x (Qi  the  response  to  Nhlch  Is  the  concatenated  vector  (Zx...Zf|)  e®«(ZIx  ...  x®. 

The  random  element  Is  (^...un)  €  n  x  a  x  ...  x  n.  Equation  (4.1-3)  Is  then  simply  a  restatement  of 
Equation  (4.1-2)  on  the  larger  space. 

For  simplicity  of  notation.  Me  Mill  generally  omit  the  dependence  on  U  from  Equations  (4.1-1)  and 
(4.1-2).  For  the  most  part.  Me  Mill  be  discussing  parameter  estimation  based  on  responses  to  specific,  knoMn 
Inputs:  therefore,  the  dependence  of  the  response  and  the  estimate  on  the  Input  are  Irrelevant,  and  merely 
clutter  up  the  notation.  Formally,  all  of  the  distributions  and  expectations  may  be  considered  to  be  Implic¬ 
itly  conditioned  on  U. 

Note  that  the  estimate  l  Is  a  random  variable  because  It  Is  a  function  of  Z,  Mhlch  Is  a  random  varia¬ 
ble.  When  the  experiment  Is  actually  performed,  specific  realizations  of  these  random  variables  mIU  be 
obtained.  The  true  parameter  value  t  Is  not  usually  considered  to  be  random,  simply  unknoMn. 

In  some  situations,  hoMever,  It  Is  convenient  to  define  e  as  a  random  variable  Instead  of  as  an  unknoMn 
parameter.  The  significant  difference  betMeen  these  approaches  Is  that  a  random  variable  has  a  probability 
distribution,  Mhlch  constitutes  additional  Information  that  can  be  used  In  the  random-variable  approach. 

Several  popular  estimators  can  only  be  defined  using  the  random-variable  approach.  These  advantages  of  the 
random-variable  approach  are  balanced  by  the  necessity  to  knoM  the  probability  distribution  of  e.  If  this 
distribution  Is  not  knoMn,  there  are  no  differences,  except  In  terminology,  betMeen  the  random-variable  and 
unknoMn-parameter  approaches . 

A  third  vleM  of  e  Involves  Ideas  from  Information  theory.  In  this  context,  £  Is  considered  to  be  an 
unknoMn  parameter  as  above.  Even  though  E  Is  not  random.  It  Is  defined  to  have  a  "probability  distribution." 
This  probability  distribution  does  not  relate  to  any  randomness  of  t,  but  reflects  our  knowledge  or  informa¬ 
tion  about  the  value  of  £.  Distributions  with  low  variance  correspond  to  a  high  degree  of  certainty  about 
the  value  of  £,  and  vice  versa.  The  term  "probability  distribution"  Is  a  misnomer  In  this  context.  The  terms 
"Information  distribution"  or  "Information  function"  more  accurately  reflect  this  Interpretation. 

In  the  context  of  Information  theory,  the  marginal  or  prior  distribution  pU)  reflects  the  Information 
about  e  prior  to  performing  the  experiment.  A  case  where  there  Is  no  prior  Information  can  be  handled  as  a 
limit  of  prior  distributions  with  less  and  less  Information  (variance  going  to  Infinity).  The  distribution  of 
the  response  Z  Is  a  function  of  the  value  of  £.  When  £  Is  a  random  variable,  this  Is  called  p(Zjc),  the 
conditional  distribution  of  z  given  £.  We  will  use  the  same  notation  when  £  Is  not  random  In  order  to 
emphasize  the  dependence  of  the  distribution  on  £,  and  for  consistency  of  notation.  When  p(£)  Is  defined, 
the  joint  probability  density  Is  then 

p(Z,£)  ■=  p(Zle)p(e)  (4.1-4) 

The  marginal  probability  density  of  Z  Is 


p(Z)  -  Jp(Z,e)d|t| 


(4.1-5) 
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The  conditional  density  of  $  given  Z  (also  called  the  posterior  density)  Is 

><«!»  ■  W  ■  “If1 


(4.1-6) 


In  the  Information  theory  context,  the  posterior  distribution  reflects  Information  about  the  value  of  c  after 
the  experiment  Is  performed.  It  accounts  for  the  Information  known  prior  to  the  experiment,  and  the  Informa¬ 
tion  gained  by  the  experiment. 


The  distinctions  among  the  random  variable,  unknown  parameter,  and  Information  theory  points  of  view  are 
largely  academic.  Although  the  conventional  notations  differ,  the  equations  used  are  equivalent  In  all  three 
cases.  Our  presentation  uses  the  probability  density  notation  throughout.  He  see  little  benefit  In  repeating 
Identical  derivations,  substituting  the  term  "Information  function*  for  *11ke11hood  function"  and  changing 
notation.  Me  derive  the  basic  equations  only  once,  restricting  the  distinctions  among  the  three  points  of  view 
to  discussions  of  applicability  and  Interpretation. 


4.2  PROPERTIES  OF  ESTIMATORS 

Me  can  define  an  Infinite  number  of  estimators  for  a  given  problem.  The  definition  of  an  estimator  pro' 
vldes  no  means  of  evaluating  these  estimators,  some  of  which  can  be  ridiculously  poor.  This  section  will 
describe  some  of  the  properties  used  to  evaluate  estimators  and  to  select  a  good  estimator  for  a  particular 
problem.  The  properties  are  all  expressed  In  terms  of  optimality  criteria. 

4.2.1  Unbiased  Estimators 


A  bias  Is  a  consistent  or  repeatable  error.  The  parameter  estimates  from  any  specific  data  set  will 
always  be  Imperfect.  It  Is  reasonable  to  hope,  however,  that  the  estimate  obtained  from  a  large  sat  of 
maneuvers  would  be  centered  around  the  true  value.  The  errors  In  the  estimates  might  be  thought  of  as  consist¬ 
ing  of  two  components- consistent  errors  and  random  errors.  Random  errors  are  generally  unavoidable.  Consis¬ 
tent  or  average  errors  might  be  removable. 

Let  us  restate  the  above  Ideas  more  precisely.  The  bias  b  of  an  estimator  £(.)  1$  defined  as 

b(t)  -  EU|5>  -  5  -  EU(Z(c,«))|0  -  6  (4.2-1) 

The  Z  In  these  equations  Is  a  random  variable,  not  a  specific  realization.  Note  that  the  bias  Is  a  function 
of  the  true  value.  It  averages  out  (by  the  E{.})  the  random  noise  effects,  but  there  Is  no  averaging  among 
the  different  true  values.  The  bias  Is  also  a  function  of  the  Input  U,  but  this  dependence  Is  not  usually 
made  explicit.  All  discussions  of  bias  are  Implicitly  referring  to  some  given  Input. 

An  unbiased  estimator  Is  defined  as  an  estimator  for  which  the  bias  Is  Identically  zero: 

b(t)  -  0  (4.2-2) 

This  requirement  is  quite  stringent  because  it  must  be  met  for  every  value  of  t.  Unbiased  estimators  may  not 

exist  for  some  problems.  For  other  problems,  unbiased  estimators  may  exist,  but  may  be  too  complicated  for 

practical  computation.  Any  estimator  that  Is  not  unbiased  Is  called  biased. 

Generally,  It  Is  considered  desirable  for  an  estimator  to  be  unbiased.  This  judgment,  however,  does  not 
apply  to  all  situations.  The  bias  of  an  estimator  measures  only  the  average  of  Its  behavior.  It  Is  possible 
for  the  Individual  estimates  to  be  so  poor  that  they  are  ludicrous,  yet  average  out  so  that  the  estimator  Is 
unbiased.  The  following  exanple  Is  taken  from  Ferguson  (1967,  p.  136). 

Example  4.2-1  A  telephone  operator  has  been  working  for  10  minutes  and  won¬ 
ders  If  he  would  be  missed  If  he  took  a  20  minute  coffee  break.  Ass  woe  that 

calls  are  coming  In  as  a  Poisson  process  with  the  average  rate  of  X  calls 
per  10  minutes,  x  being  unknown.  The  number  Z  of  calls  received  In  the 
first  10  minutes  has  a  Poisson  distribution  with  parameter  x. 

P(Z|X)  -  Z-0,1... 


On  the  basis  of  Z,  the  operator  desires  to  estimate  B,  the  probability  of 
receiving  no  calls  In  the  next  20  minutes.  For  a  Poisson  process,  B  *  e-,x 
If  the  estimator  8 (Z)  Is  to  be  unbiased,  we  must  have 


E(B(Z(8,u))|B}  “  B  for  all  Be  (0,1] 

Thus 

m  A  2 

£  B(Z)  .  b  -  e'lX  for  all  X  e  [0,-) 

Z»o 


S  S<z>TT-e-* 


Multiply  by  e*,  giving 


^rasrarer 

s 

£ 


.T.T..  V  . 


imim  r-ru-'v.  u.,.i.’Ji|j^jj.x..i^.r.iri,T 
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Expand  the  right-hand  side  as  a  power  serial  to  gat 


i^-z1# 

Z-«  Z-I 


Tha  convergent  power  series  are  equal  for  all  x  c  [0,«)  If  the  coefficients 
are  Identical.  Thus  0(Z)  •  (-l)z  Is  the  only  unbiased  estimator  of  B  for 
this  problea.  The  operator  would  estlmte  the  probability  of  Missing  no 
calls  as  ♦ 1  If  he  had  received  an  evai  nunber  of  calls  and  -1  If  he  had 
received  an  odd  m*6ar  of  calls.  Thl:  estlwator  Is  the  only  unbiased  estlMtor 
for  the  problaa,  but  It  Is  a  ridiculously  poor  one.  If  the  estlnates  are 
required  to  lie  In  the  Meaningful  range  of  [0,1],  then  there  Is  no  unbiased 
estlMtor,  but  som  quite  reasonable  biased  estleators  can  be  easily  constructed. 

The  bias  Is  a  useful  tool  for  studying  estlMtors.  In  general.  It  Is  desirable  for  the  bias  to  be  zero, 
or  at  least  small.  However,  because  the  bias  Measures  only  the  average  properties  of  the  estlMtes,  It  cannot 
be  used  as  the  sole  criterion  for  evaluating  estlnators.  It  Is  possible  for  a  biased  estlMtor  to  be  clearly 
superior  to  all  of  the  unbiased  estlMtors  for  a  problem. 

4.2.2  Minimum  Variance  Estimators 

The  variance  of  an  estimator  Is  defined  as 

var(f)  -  E{(t  -  E{||c»(c  -  E{C|t))*|c>  (4.2-3) 

Note  that  the  variance,  like  the  bias.  Is  a  function  of  the  Input  and  the  true  value.  The  variance  alone  Is 
not  a  reasonable  Masure  for  evaluating  an  estlMtor.  For  Instance,  ary  constant  estimator  (one  that  always 
returns  a  constant  value.  Ignoring  the  data)  has  zero  variance.  These  are  obviously  poor  estlMtors  In  most 
situations. 

A  more  useful  measure  Is  the  mean  square  error: 
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mse(l)  -  E( (c  -  C)*|t>  (4.2-4) 

The  mean  square  error  and  variance  are  obviously  Identical  for  unbiased  estlMtors  (E{£|c>  ■  {)•  An  estlMtor 
Is  uniformly  minimum  mean-square  error  If,  for  every  value  of  C,  Its  mean  square  error  Is  less  than  or  equal 
to  the  mean  square  error  of  any  other  estlMtor.  Note  that  the  mean-square  error  Is  a  symmetric  Mtrlx.  One 
symmetric  Mtrlx  Is  less  than  or  equal  to  another  If  their  difference  Is  positive  seml-deflnlte.  This  defini¬ 
tion  Is  somewhat  academic  at  this  point  because  such  estlMtors  do  not  exist  except  In  trivial  cases.  A  con¬ 
stant  estlMtor  has  zero  mean-square  error  when  (  Is  equal  to  the  constant.  (The  performance  Is  poor  at 
other  values  of  (.)  Therefore,  In  order  to  be  uniformly  minimum  mean-square  error,  an  estlMtor  would  have 
to  have  zero  mean-square  error  for  every  c;  otherwise,  a  constant  estlMtor  would  be  better  for  that  i. 

The  concept  of  minimum  mean-square  error  beccxaes  more  useful  If  the  class  of  estlMtors  allowed  Is 
restricted.  An  estlMtor  Is  uniformly  minimum  Man-square  error  unbiased  If  It  Is  unbiased  nd,  for  every 
value  of  c,  Its  mean-square  error  Is  less  than  or  equal  to  that  of  any  other  unbiased  esJ  .tor.  Such  esti¬ 
mators  do  not  exist  for  every  problem,  because  the  requIreMnt  must  hold  for  every  value  of  (.  EstlMtors 
optimum  In  this  sense  exist  for  Mny  problems  of  Interest.  The  mean-square  error  and  the  variance  are  Identi¬ 
cal  for  unbiased  estlMtors,  so  such  optlMl  estlMtors  are  also  called  uniformly  minimum  variance  unbiased 
estlMtors.  They  are  also  often  called  simply  minimum  variance  estlMtors.  This  term  should  be  regarded  as 
an  abbreviation,  because  It  Is  not  meaningful  In  itself. 

4.2.3  CraMr-Rao  I-.-vi^Hty  (Efficient  EstlMtors) 

The  Cramer-Rao  Inequality  Is  one  of  the  central  results  used  to  evaluate  the  perfonnance  of  estlMtors. 
The  Inequality  gives  a  theoretical  limit  to  the  accuracy  that  Is  possible,  regardless  of  the  estlMtor  used. 

In  a  sense,  the  Cramer-Rao  Inequality  gives  a  measure  of  the  Infonnatlon  content  of  the  data. 

Before  deriving  the  Cramer-Rao  Inequality,  let  us  prove  a  brief  lemma. 

Lemma  4.2-1  Let  X  and  Y  be  two  random  N- vectors.  Then 

Ef XX*}  i  E{XY*}[E{YY*}]_lE{YX*>  (4.2-5) 

assuming  *hat  th'  '.rse  exists. 

Prj-^.  !he  pro:  :  >s  done  by  completing  the  square.  Let  A  be  any  nonrandom 
JTBy^H  Mtrlx.  Then 


E{ (X  -  AY)(X  -  AY)*}  i  0  (4.2-6) 

because  It  Is  a  cov>v  •’•nee  Mtrlx.  Expanding 

t  t  AE{YX*}  +  E{XY*}A*  -  AE{YY*}A*  (4.2-7) 


choose 


A  -  E{XY*}[E{YY*}]_1 


(4.2-8) 


Than 
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E(XX*}  a  E{XY*}[E(YY*)]‘lE{VX*>  ♦  E{XY*>[E{YY*)]‘1E{YX*> 

-  [E{XY*>[E{YY*>]‘lE{YY*>tE{YY*}]-lE{YX*)  (4.2-9) 

or 

E{XX*>  2  E{XY*)[E{YY*)]“1E{YX*>  (4.2-5) 


completing  the  1mm. 


Wo  now  sock  to  find  •  bound  on  EtU  -  c)U  -  0*|ch  the  moon  square  error  of  the  ostlMto. 

Theorem  4,2-2  (Cramer-Rao)  Assume  that  the  density  p(Z|0  exists  and  Is 
smoeCOnough  to  allow  the  operations  below.  (See  Cramfr  (1946)  for  details.) 
This  assumption  proves  adequate  for  most  cases  of  Interest  to  us.  Pitman  (1979) 
discusses  some  of  the  cases  where  p(ZU)  Is  not  as  smooth  as  required  here. 

Then 


Et(C(z)  -  c)(l(z)  -  e)U>  i  [I  ♦  95b(e)]w(c)-l[i  +  vcb(t)]* 

where 

M(C)  -  E((v*  in  p(Z|t))(Y{  in  p(Z|0)|D 

Proof  Let  X  and  Y  from  lemma  (4.2-1)  be  £(Z)  -  c  and  in  p(Zk), 
respectively,  and  let  all  of  the  expectations  In  the  lemma  be  conditioned 
on  t.  Concentrate  first  on  the  term 

E{XY*|U  s  E{(c(Z)  -  0(V{  in  p(Z|0)U> 

h  /U(Z)  -  C)(75  in  p(Z|0)p(Z|0d|Z| 

where  d|Z|  Is  the  volume  element  In  the  space  Z.  Substituting  the 
relation 

,  »,(z|0 
vc ln  p(zl5)  *  ifrfer 

glves 

E{XY*|e>  - f  (€(Z)  -  c)(«cP(Z|c))d|Z| 

- / e(Z)(vcP(Z|0)d|Zl  - /c(vtp(ZU))d|Z| 

Now  l(Z)  Is  not  a  function  of  £.  Therefore,  assuming  sufficient  smoothness 
of  o(Z|c)  as  a  function  of  C,  the  first  term  becomes 

p(Z)vcp(Z|Od|Z|  -  v5  f  €(Z)p(Z) c)d|Z) 

-  v(E{f(Z)|c) 

Using  the  definition  (Equation  (4.2-1))  of  the  bias,  obtain 
vcEU(Z)|f,}  -  v£U  +  bU)]  -  I  +  v£bU) 

In  the  second  term  of  Equation  (4.2-14),  £  Is  not  a  function  of  Z,  so 
Jtv£p(Z|Od|Z|  -  £v£Jp(Z|£)d|Z| 

-  £V£1  -  0 

Using  Equations  (4.2-16)  and  (4.2-17)  In  Equation  (4.2-14)  gives 

E<XY*|c>  -  I  +  v£bU) 

Define  the  Fisher  Information  matrix 

MU)  =  E{YY*|t)  =  E((v*  in  p(Zj£))(v£  in  p(Z|0)|€) 

They  by  lema  (4.2-1) 

E<U(Z)  -  £)(S(Z)  -  0*|£)  t  [I  +  7£b(£)]M(£)-1[l  +  v£bU)]* 
which  Is  the  desired  result. 


(4.2-10) 

(4.2-11) 

(4.2-12) 

(4.2-13) 

(4.2-14) 

(4.2-15) 

(4.2-16) 

(4.2-17) 

(4.2-18) 

(4.2-19) 

(4.2-10) 


Equation  (4.2-10)  Is  the  Cramer-Rao  Inequality.  Its  specialization  to  unbiased  estimators  Is  of  particular 
Interest.  For  an  unbiased  estimator,  bU)  Is  zero  so 


E(U(Z)  -  £)(£( Z)  -  £)*|c>  a  MU)'1 


(4.2-20) 
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This  gives  us  •  lower  bound,  as  a  function  of  c,  on  tho  achievable  varlanca  of  any  unbiased  estimator.  An 
unbiased  astluator  which  attains  tha  equality  In  Equation  (4.2-20)  Is  cal  lad  an  afflclant  astluator.  No 
astluator  can  achlava  a  lowar  varlanca  than  an  afflclant  astlntor  except  by  Introducing  a  bias  In  tha  esti¬ 
mates.  in  this  sansa,  an  afflclant  estimator  aakas  tha  most  usa  of  tha  Information  avallabla  In  tha  data. 


Tha  abova  davalopmant  glvas  no  guarantaa  that  an  afflclant  astluator  exists  for  ovary  problem.  Whan  an 
afflclant  astlmator  does  exist,  It  Is  also  a  uniformly  minimum  varlanca  unbiased  estimator.  It  Is  much  easier 
to  check  for  equality  In  Equation  (4.2-20)  than  to  directly  prove  that  no  other  unbiased  astlmator  has  a 
smaller  variance  than  a  given  astlmator.  Tha  Cramer-Rao  Inequality  Is  therefore  useful  as  a  sufficient  (but 
not  necessary)  check  that  an  astlmator  1$  uniformly  minimum  varlanca  unbiased. 


A  useful  alternative  expression  for  tha  Information  matrix  N  can  be  obtained  If  p(Z|c)  Is  sufficiently 
smooth.  Applying  Equation  (4.2-13)  to  the  definition  of  N  (Equation  (4.2-19))  gives 


.  Er(v;p»io)(veP(zic» 
l  p(z|0* 


(4.2-21) 


Than  examine 


E<’<  <*  Kzioit)  •  ejjj  ^jt?t  I'} 

fjwzlt)  n  fWpaUHVBIH 

■tw-  lrEt 


<} 


The  second  term  Is  equal  to  M(c),  as  shown  In  Equation  (4.2-21).  Evaluate  tha  first  term  as 


(4.2-22) 


/v!p(Z|0  # 

p(2|e)d|2|  -Jv*p(Z|Od|Z|  (4.2-23) 

-  P(Z|c)d|Z| 

-?*l-0  (4.2-23) 

Thus  an  alternate  expression  for  the  Information  matrix  Is 

M(C)  -  -E(V*  tn  p(Z|0|O  (4.2-24) 


4.2.4  Baveslan  Optimal  Estimators 

The  optimality  conditions  of  the  previous  sections  have  been  quite  restrictive  In  that  they  must  hold 
simultaneously  for  every  possible  value  of  e.  Thus  for  some  problems,  no  estimators  exist  that  are  optimal 
by  these  criteria.  The  Bayesian  approach  avoids  this  difficulty  by  using  a  single,  overall,  optimality 
criterion  which  averages  the  errors  made  for  different  values  of  5.  With  this  approach,  an  optimal  estimator 
may  be  worse  than  a  nonoptlmal  one  for  specific  values  of  t,  but  the  overall  averaged  performance  of  the 
Bayesian  optimal  estimator  will  be  better. 

The  Bayesian  approach  requires  that  a  loss  function  (risk  function,  optimality  criterion)  be  defined  as  a 
function  of  the  true  value  c  and  the  estimate  €.  The  most  common  loss  function  Is  a  weighted  square  error 

JU.O  -  U  -  ?)*R(t  -  0  (4.2-25) 

where  R  Is  a  weighting  matrix.  An  estimator  Is  considered  optimal  In  the  Bayesian  sense  If  It  minimizes  the 
a  poaUrioH  expected  value  of  the  loss  function: 

EU(C.I(Z))|Z>  -  /JUU(Z))pU|Z)d|c| 

JjU,t(z))p(z|OpU)dUI 

-  (4-2-26) 

An  optimal  estimator  must  minimize  this  expected  value  for  each  Z.  Since  P(Z)  Is  not  a  function  of  t,  It 
does  not  affect  the  minimization  of  Equation  (4.2-26)  with  respect  to  £.  Thus  a  Bayesian  optimal  estimator 
also  minimizes  the  expression 

/o(e.e(z))p(z|c)p(«)d|5|  (4.2-27) 

Note  that  p(t),  the  probability  density  of  t,  Is  required  In  order  to  define  Bayesian  optimality.  For  this 
purpose,  pU)  can  be  considered  simply  as  a  weighting  that  Is  part  of  the  loss  function,  If  It  cannot 
appropriately  be  Interpreted  as  a  true  probability  density  or  an  Information  function  (Section  4.1). 

4.2.5  Asymptotic  Properties 

Asymptotic  properties  concern  the  characteristics  of  the  estimates  as  the  amount  of  data  used  Increases 
toward  Infinity.  The  amount  of  data  used  can  Increase  either  by  repeating  experiments  or  by  Increasing  the 
time  slice  analyzed  In  a  single  experiment.  (The  latter  Is  pertinent  only  for  dynamic  systems.)  Since  only 
a  finite  amount  of  data  can  be  used  in  practice,  It  Is  not  Immediately  obvious  why  there  Is  any  Interest  In 
asymptotic  properties. 
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This  Interest  arises  primarily  fro*  considerations  of  sl^llclty.  It  Is  often  slepler  to  compute  asymp- 
totlc  properties  and  to  construct  asymptotically  optimal  astlmators  than  to  do  so  for  finite  amounts  of  data. 
Me  can  then  use  the  asymptotic  results  as  good  approximations  to  the  more  difficult  finite  data  results  If  the 
amount  of  data  used  Is  large  enough.  The  finite  data  definitions  of  unbiased  estimators  and  efficient  esti¬ 
mators  have  direct  asymptotic  analogues  of  Interest.  An  estimator  Is  asymptotically  unbiased  If  the  bias  goes 
to  zero  for  all  5  as  the  amount  of  data  goes  to  Infinity.  An  estimator  Is  asymptotically  efficient  If  It  Is 
asymptotically  unbiased  and  If 


-  Otf  -  5>*|0  -  I  (4.2-28) 

as  the  amount  of  data  approaches  Infinity.  Equation  (4.2-28)  Is  an  asymptotic  expression  for  equality  In 
Equation  (4.2-20). 

One  Important  asymptotic  property  has  no  finite  data  analogue.  This  Is  the  notion  of  consistency.  An 
estimator  1$  consistent  If  t  *  K  *s  the  amount  of  data  goes  to  Infinity.  For  strong  consistency,  the  con¬ 
vergence  1$  required  to  be  with  probability  one.  Note  that  strong  consistency  Is  defined  In  terms  of  the 
convergence  of  Individual  realizations  of  the  estimates,  unlike  the  bias,  variance,  and  other  properties  which 
are  defined  In  terms  of  average  properties  (expected  values). 

Consistency  Is  a  stronger  property  than  asymptotic  unbiasedness;  that  Is,  all  consistent  estimators  are 
asymptotically  unbiased.  This  Is  a  basic  convergence  result- that  convergence  with  probability  one  Implies 
convergence  In  distribution  (and  thus,  specifically,  convergence  In  mean).  Me  refer  the  reader  to  Llpster  and 
Shlryayev  (1977),  Cramir  (1948),  Goodwin  and  Payne  (1977),  Zacks  (1971),  and  Mehra  and  Lalniotls  (1976)  for 
this  and  other  results  on  consistency.  Results  on  consistency  tend  to  Involve  careful  mathematical  arguments 
relating  to  different  types  of  convergence. 

Me  will  not  delve  deeply  Into  asymptotic  properties  such  as  consistency  In  this  book.  Me  generally  feel 
that  asymptotic  properties,  although  theoretically  Intriguing,  should  be  played  down  In  practical  application. 
Application  of  Infinite-time  results  to  finite  data  Is  an  approximation,  one  that  Is  sometimes  useful,  but 
sometimes  gives  completely  misleading  conclusions  (see  Section  8.2).  The  Inconsistency  should  be  evident  In 
books  that  spend  copious  time  arguing  fine  points  of  distinction  between  different  kinds  of  convergence  and 
then  pass  off  application  to  finite  data  with  cursory  allusions  to  using  large  data  samples. 

Although  we  de-emphaslze  the  "rigorous"  treatment  of  asymptotic  properties,  some  asymptotic  results  are 
crucial  to  practical  Implementation.  This  Is  not  because  of  any  Improved  rigor  of  the  asymptotic  results,  but 
because  the  asymptotic  results  are  often  simpler,  sometimes  enough  simpler  to  make  the  critical  difference  In 
Usability.  This  Is  our  primary  use  of  asymptotic  results:  as  simplifying  approximations  to  the  finite-time 
results.  Introduction  of  complicated  convergence  arguments  hides  this  essential  role.  The  approximations  work 
well  In  many  cases  and,  as  with  most  approximations,  fall  In  some  situations.  Our  emphasis  In  asymptotic 
results  will  center  on  justifying  when  they  are  appropriate  and  understanding  when  they  fall. 


4.3  COMMON  ESTIMATORS 

This  section  will  define  some  of  the  commonly  used  general  types  of  estimators.  The  list  Is  far  from 
complete;  we  mention  only  those  estimators  that  will  be  used  In  this  book.  Me  also  present  a  few  general 
results  characterizing  the  estimators. 

4.3.1  A  posteriori  Expected  yalue 

One  of  the  most  natural  estimates  Is  the  a  posteriori  expected  value.  This  estimate  Is  defined  as  the 
mean  of  the  posterior  distribution. 

t(l)  -  EU|Z)  -  Jcp(c|Z)4|c| 

Ap(Z|t)pU)d|c| 

-*7 - — -  (4.3-1) 

jPttU)p(Od|?| 

This  estimator  requires  that  p(c),  the  prior  density  of  c,  be  known. 

4.3.2  Bayesian  Minimum  Risk 

Bayesian  optimality  was  defined  In  Suction  4.2.4.  Any  estimator  which  minimizes  the  a  posteriori 
expected  value  of  the  loss  function  Is  a  Bayesian  minimum  risk  estimator.  (In  general,  there  can  be  more  than 
one  such  estimator  for  a  given  problem.)  The  prior  distribution  of  e  must  be  known  to  define  Bayesian 
estimators. 


Theorem  4.3-1  The  a  posteriori  expected  value  (Section  4.3.1)  Is  the  unique 
Bayesian  minimum  risk  estimator  for  the  loss  function 

JU,f.)  ■  U  -  l)*R(C  -  t)  (4.3-2) 

where  R  1$  any  positive  definite  symmetric  matrix. 

Proof  A  Bayesian  minimum  risk  estimator  must  minimize 

E{J|Z)  -  E{(«  -  S(Z))*R(C  -  t(Z))|Z) 


(4.3-3) 


VJ  \",V  V.\: 
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Since  R  Is  symmetric,  the  gradient  of  this  function  Is 

V-Ef J|Z>  -  -2E{R(t  -  £(Z))|Z}*  (4.3-4) 

Setting  this  expression  to  zero  gives 

0  -  R  E{£  -  £(Z)|Z)  -  R[E{ 6 1 Z)  -  £(Z)]  (4.3-5) 

Therefore 

C(Z)  -  Etc | Z)  (4.3-6) 

Is  the  unique  stationery  point  of  E{J|Zh  The  second  gradient  Is 

v|E{J|Z>  -  2R  >  0  (4.3-7) 

so  the  stationary  point  Is  the  global  minimum. 

Theorem  (4.3-1)  applies  only  for  the  quadratic  loss  function  of  Equation  (4.3-2).  The  following  very 
similar  theorem  applies  to  a  much  broader  class  of  loss  functions,  but  requires  the  assumption  that  p(t|Z)  Is 
symmetric  about  Its  mean.  Theorem  (4.3-1)  makes  no  assumptions  about  p(t|Z)  except  that  It  has  finite  mean 
and  variance. 


Theorem  4.3-2  Assume  that  p(c|Z)  Is  symmetric  about  Its  mean  for  each  Z;  l.e.. 


Pcjz(C(Z)  +  c|Z)  -  PC|Z(5(Z)  -  C|Z)  (4.3-8) 

where  t(Z)  Is  the  expected  value  of  £  given  Z.  Then  the  a  posteriori  expected  value  Is  th 
unique  Bayesian  minimum  risk  estimator  for  any  loss  function  of  the  form 

J(£,£)  -  -  i)  (4.3-9) 

where  Is  symmetric  about  0  and  is  strictly  convex. 

Proof  We  need  to  demonstrate  that 

0(a)  =  E{J(t,c(Z)  +  a|Z}  -  E{J(£,£(Z) |Z>  >  0  (4.3-10) 

for  all  a  f  0.  Using  Equation  (4.3-9)  and  the  definition  of  expectation 

0(a)  -  IPUIZKJjU  -  £(Z)  -  a)  -  Jx(£  -  £(Z))]d|t|  (4.3-11) 

Because  of  the  symmetry  of  pU|Z),  we  can  replace  the  Integral  In  Equa¬ 
tion  (4.3-11)  by  an  Integral  over  the  region 

S  «{£:(£-  £(Z), a)  >  0}  (4.3-12) 


giving 

0(a)  ■  JT  P(t | Z)(Jj (t  -  £(Z)  -  a)  +  Jx(t(z)  -  £  -  a) 

-  Jt(£  -  £(Z))  -  J^cU)  -  £ )]d | £ |  (4.3-13) 

Using  the  symmetry  of  gives 

0(a)  -  P(£|Z)[J1(£  -  £(Z)  -  a)  +  Jx(£  -  £(Z)  +  a) 

S 

-  2Jl(£  -  £(Z)]d|£|  (4.3-14) 

By  the  strict  convexity  of  Jx 

Jt(£  -  £ (Z)  -  a)  +  Jx(£  -  £ (Z)  +  a)  >  2Jj (£  -  £(Z))  (4.3-15) 

for  all  a  f  0.  Therefore  0(a)  >  0  for  all  a  t  0  as  we  desired  to  show. 

Note  that  If  J,  Is  convex,  but  not  strictly  convex,  theorem  (4.3-2)  still  holds  except  for  the  unique¬ 
ness.  Theorems  (4.3-1)  and  (4.3-2)  are  two  of  the  basic  results  in  the  theory  of  estimation.  They  motivate 
the  use  of  a  posteriori  expected  value  estimators. 


4.3.3  Maximum  a  posteriori  Probability 

The  maximum  a  posteriori  probability  (MAP)  estimate  Is  defined  as  the  mode  of  the  posterior  distribution 
(l.e.,  the  value  of  £  which  maximizes  the  posterior  density  function).  If  the  distribution  Is  not  unlmodal, 
the  MAP  estimate  may  not  be  unique.  As  with  the  previously  discussed  estimators,  the  prior  distribution  of 
£  must  be  known  In  order  to  define  the  MAP  estimate. 


The  MAP  estimate  Is  equal  to  the  a  posteriori  expected  value  (and  thus  to  the  Bayesian  minimum  risk  for 
loss  functions  meeting  the  conditions  of  Theorem  (4.3-2))  If  the  posterior  distribution  Is  symmetric  about  Its 
mean  and  unlmodal,  since  the  mode  and  the  mean  of  such  distributions  are  equal.  For  nonsymmetrlc  distribu¬ 
tions,  this  equality  does  not  hold. 
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The  NAP  estimate  It  generally  Much  easier  to  calculate  than  the  a  posteriori  expected  valua.  Tha 
a  posteriori  expected  valua  It  (froM  Equation  (4.3-1)) 

Ap(Z|c)p(()d|t| 

UU  -  7 -  (4.3-16) 

jpU)|c)p(c)d|e| 

This  calculation  raqulras  tha  avaluatlon  of  two  Integrals  over  S.  Tha  MAP  estimate  requires  the  maxi¬ 
mization  of 


p(ciZ)  «  (4.3-17) 

with  respect  to  c>  The  p(Z)  Is  not  a  function  of  c,  so  the  NAP  estimate  can  also  be  obtained  by 

UU  *  arg  max  p(Z|OpU)  (4.3-18) 

The  "arg  max"  notation  Indicates  that  i  Is  the  value  of  c  that  maximizes  the  density  function  p(Z|c)pU). 
The  maximization  In  Equation  (4.3-18)  1$  generally  much  simpler  than  the  Integrations  In  Equation  (4.3-16). 

4.3.4  Maximum  likelihood 

The  previous  estimators  have  all  required  that  the  prior  distribution  of  (  be  known.  Mhen  (  Is  not 
random  or  when  Its  distribution  Is  not  known ,  there  are  far  fewer  reasonable  estimators  to  choose  from.  Maxi¬ 
mum  likelihood  estimators  are  the  only  type  that  we  will  discuss. 

The  maximum  likelihood  estimate  Is  defined  as  the  value  of  c  which  maximizes  the  likelihood  functional 
p(Z|e);  In  other  words, 


t(Z)  -  arg  max  p(Z|c)  (4.3-19) 

C 

The  maximum  likelihood  estimator  Is  closely  related  to  the  MAP  estimator.  The  MAP  estimator  maximizes  p(c|Z); 
heurlstlcally  we  could  say  that  the  MAP  estimator  selects  the  most  probable  value  of  c,  given  the  data.  The 
maximum  likelihood  estimator  maximizes  p(Zk);  l.e.,  It  selects  the  value  of  (  which  makes  the  observed  data 
most  plausible.  Although  these  may  sound  like  two  statements  of  the  same  concept,  there  are  crucial  differ¬ 
ences.  One  of  the  most  central  differences  Is  that  maximum  likelihood  Is  defined  whether  or  not  the  prior 
distribution  of  t  Is  known. 

Comparing  Equation  (4.3-18)  with  Equation  (4.3-19)  reveals  that  the  maximum  likelihood  estimate  Is  Iden¬ 
tical  to  the  MAP  estimate  If  p(c)  Is  a  constant.  If  the  parameter  space  s  has  finite  size,  this  Implies 
that  p(t)  Is  the  uniform  distribution.  For  Infinite  s,  such  as  Rn,  there  are  no  uniform  distributions,  so 
a  strict  equivalence  cannot  be  established.  If  we  relax  our  definition  of  a  probability  distribution  to  allow 
arbitrary  density  functions  which  need  not  Integrate  to  1  (sometimes  called  generalized  probabilities),  the 
equivalence  can  be  established  for  any  s.  Alternately,  the  uniform  distribution  for  Infinite  size  s  can  be 
viewed  as  a  limiting  case  of  distributions  with  variance  going  to  Infinity  (less  and  less  prior  certainty  about 
the  value  of  e). 

The  maximum  likelihood  estimator  places  no  preference  on  any  value  of  C  over  any  other  value  of  the 
estimate  Is  solely  a  function  of  the  data.  The  MAP  estimate,  on  the  other  hand,  considers  both  the  data  and 
the  preference  defined  by  the  prior  distribution. 

Maximum  likelihood  estimators  have  many  Interesting  properties,  which  we  will  cover  later.  One  of  the 
most  basic  Is  given  by  the  following  theorem: 

Theorem  4.3-3  If  an  efficient  estimator  exists  for  a  problem,  that  estimator 
Is  a  maximum  likelihood  estimator. 

Proof  (This  proof  requires  the  use  of  the  full  notation  for  probability 
density  functions  to  avoid  confusion.)  Assume  that  UU  is  any  efficient 
estimator.  An  estimator  will  be  efficient  If  and  only  If  equality  holds 
In  lemma  (4.2-1).  Equality  holds  If  and  only  if  X  ■  AY  In  Equation  (4.2-6). 

Substituting  for  A  from  Equation  (4.2-8)  gives 

X  -  E{XY*}E{YY*)'1Y  (4.3-20) 

Substituting  for  X  and  Y  as  In  the  proof  of  the  Cramer-Rao  bound,  and  using 
Equations  (4.2-18)  and  (4.2-19)  gives 

UU  -  t  ■  [I  +  v5b(E)]M(t)-1V*  »n  Pz(5(Z|0  (4.3-21) 

Efficient  estimators  must  be  unbiased,  so  b(0  Is  zero  and 

UU  -  C  *  M(E)_iv|  tn  PZ|C(Z|C)  (4.3-22) 

For  an  efficient  estimator,  Equation  (4.3-22)  must  hold  for  all  values  of  Z 
and  t.  In  particular,  for  each  Z,  the  equation  must  hold  for  5  ■  £(Z). 

The  left-hand  side  Is  then  zero,  so  we  must  have 

tn  PZ|€(Z'C(Z))  *  0 


(4.3-23) 
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The  estimate  Is  thus  <t  a  stationary  point  of  the  likelihood  functional. 

Taking  the  gradient  of  Equation  (4.3-22) 

-I  -  MU)-‘v*  in  PZ|C(Z|C)  -  HU)*1[vtMU)]MU)-1V*  in  PZ|cttiO 

Evaluating  this  at  $  -  |{Z),  and  using  Equation  (4.3-23)  gives 
-I  ■  Mtftf))*1?*  in  PZ|c(Z|t(Z)) 

Since  M  Is  positive  definite,  the  stationary  point  Is  a  local  maximum. 

In  fact,  It  1$  the  only  local  maximum,  because  a  local  maximum  at  any  point 
other  than  c  ■  £{Z)  would  violate  Equation  (4.3-22).  The  requirement  for 
/pz|r(Z|c)d|Z|  to  be  finite  Implies  that  PzIeUU)  0  as  I Z I  ♦  •,  so 
that  the  local  maximum  will  be  a  global  maximum.  Therefore  £(Z)  Is  a 
maximum  likelihood  estimator. 

Corollary  All  efficient  estimators  for  a  problem  are  equivalent  (l.e.,  if 
an  efficient  estimator  exists,  It  Is  unique). 

This  theorem  and  Its  corollary  are  not  as  useful  as  they  might  seem  at  first  glance,  because  efficient 
estimators  do  not  exist  for  many  problems.  Therefore,  It  Is  not  always  true  that  a  maximum  likelihood  esti¬ 
mator  Is  efficient.  The  theorem  does  apply  to  some  simple  problems,  however,  and  motivates  the  more  widely 
applicable  asymptotic  results  which  will  be  discussed  later. 

Maximum  likelihood  estimates  have  the  following  natural  Invariance  property:  let  £  be  the  maximum 
likelihood  estimate  of  £;  then  f(£)  Is  the  maximum  likelihood  estimate  of  f(?)  for  any  function  f.  The 
proof  of  this  statement  Is  trivial  If  f  Is  Invertible.  Let  Lg(c,Z)  be  the  likelihood  functional  of  e  for 
a  given'  Z.  Define 

x  -  f(0  (4.3-26) 

Then  the  likelihood  function  of  x  Is 

Lx(x,Z)  -  L*(f“(x).Z)  (4.3-27) 

This  Is  the  crucial  equation.  By  definition,  the  left-hand  side  Is  maximized  by  x  •  x,  and  the  right-hand 
side  is  maximized  by  f_1(x)  ■  £.  Therefore 

x  -  f(€)  (4.3-28) 

The  extension  to  noninvertible  f  is  straightforward— simply  realize  that  f'Mx)  Is  a  set  of  values,  rather 
than  a  single  value.  The  same  argument  then  still  holds,  regarding  Lx(x,Z)  as  a  one-to-many  function  (set¬ 
valued  function). 

Finally,  let  us  emphasize  that,  although  maximum  likelihood  estimates  are  formally  Identical  to  MAP  esti¬ 
mates  with  uniform  prior  distributions,  there  Is  a  basic  theoretical  difference  In  Interpretation.  Maximum 
likelihood  makes  no  statements  about  distributions  of  t,  prior  or  posterior.  Stating  that  a  parameter  has  a 
uniform  prior  distribution  Is  drastically  different  from  saying  that  we  have  no  Information  about  the  param¬ 
eter.  Several  classic  "paradoxes"  of  probability  theory  resulted  from  ignoring  this  difference.  The  para¬ 
doxes  arise  In  transformations  of  variable.  Let  a  scalar  £  have  a  uniform  prior  distribution,  and  let  f 
be  any  continuous  Invertible  function.  Then,  by  Equation  (3.4-1),  x  *  f(0  has  the  density  function 

Px(x)  -  Pe(f_1(x))|fx(x)|  (4.3-29) 

which  Is  not  a  uniform  distribution  on  x  (unless  f  Is  linear).  Thus  if  we  say  that  there  is  no  prior 
information  (uniform  distribution)  about  £,  then  this  gives  us  prior  information  (nonuniform  distribution) 
about  x,  and  vice  versa.  This  apparent  paradox  results  from  equating  a  uniform  distribution  with  the  idea 
of  "no  Information." 

Therefore,  although  we  can  formally  derive  the  equations  for  maximum  likelihood  estimators  by  substituting 
uniform  prior  distributions  In  the  equations  for  MAP  estimators,  we  must  avoid  misinterpretations.  Fisher 
(1921,  p.  326)  discussed  this  subject  at  length: 

There  would  be  no  need  to  emphasize  the  baseless  character  of  the  assumptions 
made  under  the  titles  of  Inverse  probability  and  BAYES'  Theorem  In  view  of 
the  decisive  criticism  to  which  they  have  been  exposed. ...I  must  Indeed  plead 
guilty  In  my  original  statement  of  the  Method  of  Maximum  Likelihood  (9)  to 
having  based  my  argument  upon  the  principle  of  inverse  probability;  in  the 
same  paper,  it  is  true,  I  emphasized  the  fact  that  such  inverse  probabilities 
were  relative  only.  That  is  to  say,  that  while  one  might  speak  of  one  value 
of  p  as  having  an  Inverse  probability  three  times  that  of  another  value  of  p, 
we  might  on  no  account  Introduce  the  differential  element  dp,  so  as  to  be 
able  to  say  that  It  was  three  times  as  probable  that  p  should  lie  in  one 
rather  than  the  other  of  two  equal  elements.  Upon  consideration,  therefore,  I 
perceive  that  the  word  probability  is  wrongly  used  in  such  a  connection: 
probability  Is  a  ratio  of  frequencies,  and  about  the  frequencies  of  such  values 
we  can  know  nothing  whatever.  We  must  return  to  the  actual  fact  that  one  value 
of  p,  of  the  frequency  of  which  we  know  nothing,  would  yield  the  observed 
result  three  times  as  frequently  as  would  another  value  of  p.  If  we  need  a 
word  to  characterize  this  relative  property  of  different  values  of  p,  I  suggest 


(4.3-24) 
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that  m  may  speak  without  confusion  of  the  likelihood  of  one  value  of  p 
being  thrice  the  likelihood  of  another,  bearing  always  In  mind  that  likeli¬ 
hood  Is  not  here  used  loosely  as  a  synonym  of  probability  but  simply  to 
express  the  relative  frequencies  with  which  such  values  of  the  hypothetical 
quantity  p  would  In  fact  yield  the  observed  sample. 
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CHAPTER  5 


5.0  THE  STATIC  ESTIMATION  PROBLEM 

In  this  chapter  begins  the  application  of  the  general  types  of  estimators  defined  In  Chapter  4  to 
specific  problems.  The  problems  discussed  In  this  chapter  are  static  estimation  problems;  that  Is,  problems 
where  time  Is  not  explicitly  Involved.  Subsequent  chapters  on  dynamic  systems  draw  heavily  on  these  static 
results.  Our  treatment  Is  far  from  complete;  It  Is  easy  to  spend  an  entire  book  on  static  estimation  alone 
(Sorenson,  1980).  The  material  presented  here  was  selected  largely  on  the  basis  of  relevance  to  dynamic 
systems. 

We  concentrate  primarily  on  linear  systems  with  additive  Gaussian  noise,  where  there  are  simple,  closed- 
form  solutions*  We  also  cover  nonlinear  systems  with  additive  Gaussian  noise,  which  will  prove  of  major 
Importance  In  Chapter  8.  Non-Gausslan  and  nonadditive  noise  are  mentioned  only  briefly,  except  for  the  special 
problem  of  estimation  of  variance. 

We  will  Initially  treat  nonsingular  problems,  where  we  assume  that  all  relevant  distributions  have  density 
functions.  The  understanding  and  handling  of  singular  and  Ill-conditioned  problems  then  receive  special 
attention.  Singularities  and  Ill-conditioning  are  crucial  Issues  in  practical  application,  but  are  Insuffi¬ 
ciently  treated  In  much  of  the  current  literature.  We  also  discuss  partitioning  of  estimation  problems,  an 
Important  technique  for  simplifying  the  computational  task  and  treating  some  singularities. 

The  general  form  of  a  static  system  model  Is 

Z  =  ZU.U.o)  (5.0-1) 

We  apply  a  known  specific  Input  U  (o.'  a  set  of  Inputs)  to  the  system,  and  measure  the  response  Z.  The 
vector  ui  Is  a  random  vector  contaminating  the  measured  system  response.  We  desire  to  estimate  the  value 
of  ?. 

The  estimators  discussed  In  Chapter  4  require  knowledge  of  the  conditional  distribution  of  Z  given  £ 
and  U.  We  assume,  for  now,  that  the  distribution  Is  nonsingular,  with  density  p(Z|t,U).  if  5  Is  con¬ 
sidered  random,  you  must  know  the  joint  density  p(Z,e|U).  In  some  simple  cases,  these  densities  might  be 
given  directly.  In  which  case  Equation  (5.0-1)  Is  not  necessary;  the  estimators  of  Chapter  4  apply  directly. 
More  typically,  p(Z|c,U)  is  a  complicated  density  which  is  derived  from  Equation  (5.0-1)  and  p(«j|C,U).  It  is 
often  reasonable  to  assume  quite  simple  distributions  for  »,  Independent  of  £  and  U.  In  this  chapter,  we 
will  look  at  several  specific  cases. 


5.1  LINEAR  SYSTEMS  WITH  ADDITIVE  GAUSSIAN  NOISE 

The  simplest  and  most  classic  results  are  obtained  for  linear  static  systems  with  additive  Gaussian  noise. 
The  system  equations  are  assumed  to  have  the  form 

Z  *  C(U)5  +  D(U)  +  G(U)id  (5.1-1) 

For  any  particular  U,  Z  is  a  linear  combination  of  e,  w,  and  a  constant  vector.  Note  that  there  are  no 
assumptions  about  linearity  with  respect  to  U;  the  functions  C,  D,  and  G  can  be  arbitrarily  complicated. 
Throughout  this  section,  we  omit  the  explicit  dependence  on  U  from  the  notation.  Similarly,  all  distribu¬ 
tions  and  expectations  are  implicitly  understood  to  be  conditioned  on  U. 

The  random  noise  vector  u  Is  assumed  to  be  Gaussian  and  independent  of  5.  By  convention,  we  will 
define  the  mean  of  w  to  be  0,  and  the  covariance  to  be  Identity.  This  convention  does  not  limit  the  gener¬ 
ality  of  Equation  (5.1-1),  for  if  01  has  a  mean  m  and  a  finite  covariance  FF*.  we  can  define  G2  *  GF 
and  D2  =  D  +  m  to  obtain 

Z  =  C5  +  D2  G2u2  (5.1-2) 

where  u2  has  zero  mean  and  identity  covariance. 

When  L  is  considered  as  random,  we  will  assume  that  its  marginal  (prior)  distribution  is  Gaussian  with 
mean  and  covariance  P. 

pU)  =  |2irP|-1/2  exp{-  \  (5  -  n)5)*P_1U  -  m5)|  (5.1-3) 

Equation  (5.1-3)  assumes  that  P  is  nonsingular.  ,We  will  discuss  the  Implications  and  handling  of  singular 
cases  later. 

5.1.1  Joint  Distribution  of  Z  and  s 

Several  distributions  which  can  be  derived  from  Equation  (5.1-1)  will  be  required  In  order  to  analyze  this 
system.  Let  us  first  consider  p(Z|c),  the  conditional  density  of  Z  given  5.  This  distribution  Is  defined 
whether  5  is  random  or  not.  If  e  is  given,  then  Equation  (5.1-1)  Is  simply  the  sum  of  a  constant  vector 
and  a  constant  matrix  times  a  Gaussian  vector.  Using  the  properties  of  Gaussian  distributions  discussed  In 
Chapter  3,  we  see  that  the  conditional  distribution  of  Z  given  e  is  Gaussian  with  mean  and  covariance. 

E{Z|5>  =  C?  +  D 

cov{Z|U  »  GG* 


(5.1-4) 

(5.1-5) 
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Thus,  assuming  that  GG*  Is  nonsingular, 


p(Z|e)  -  | 2ttGG* I -x/a  exp|-  |  (Z  -  C£  -  D)*(GG*)‘X(Z  -  C£  -  D)j 


(5.1-6) 


If  £  Is  random,  with  marginal  density  given  by  Equation  (5.1-3),  we  can  also  meaningfully  define  the 
joint  distribution  of  Z  and  £,  the  conditional  distribution  of  £  given  Z,  and  the  marginal  distribution 
of  Z. 

For  the  marginal  distribution  of  Z,  note  that  Equation  (5.1-1)  Is  a  linear  combination  of  Independent 
Gaussian  vectors.  Therefore  Z  Is  Gaussian  with  mean  and  covariance 


E{Z)  *  Cm£  +  D 
cov(Z)  -  CPC*  +  GG* 

For  the  joint  distribution  of  £  and  Z,  we  now  require  the  cross-correlation 

E([Z  -  E(Z)][£  -  £(£)]*>  *  CP 

The  joint  distribution  of  £  and  Z  Is  thus  Gaussian  with  mean  and  covariance 
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(5.1-10) 


(5.1-11) 


Note  that  this  joint  distribution  could  also  be  derived  by  multiplying  Equations  (5.1-3)  and  (5.1-6)  according 
to  Bayes  rule.  That  derivation  arrives  at  the  same  results  for  Equations  (5.1-10)  and  (5.1-11),  but  Is  much 
more  tedious. 

Finally,  we  can  derive  the  conditional  distribution  of  £  given  Z  (the  posterior  distribution  of  £)  from 
the  joint  distribution  of  £  and  Z.  Applying  Theorem  (3.5-9)  to  Equations  (5.1-10)  and  (5.1-11),  we  see  that 
the  conditional  distribution  of  £  given  Z  Is  Gaussian  with  mean  and  covariance 


E{£|Zr«  m5  +  PC*(CPC*  +  GG*)‘MZ  -  Cm5  -  D) 
cov(£lZ)  =  P  -  PC*(CPC*  +  GG*)'1CP 


(5.1-12) 

(5.1-13) 


Equations  (5.1-12)  and  (5.1-13)  assume  that  CPC*  +  GG*  Is  nonsingular.  If  this  matrix  Is  singular,  the 
problem  Is  Ill-posed  and  should  be  restated.  We  will  discuss  the  singular  case  later. 

Assuming  that  P,  GG*,  and  (C*(GG*)”XC  +  P"1)  are  nonsingular,  we  can  use  the  matrix  Inversion  lemmas, 
(lemmas  (A. 1-3)  and  (A. 1-4)),  to  put  Equations  (5.1-12)  and  (5.1-13)  Into  forms  that  will  prove 'intuitively 
useful . 


E{£|Z}  =  me  +  (C*(GG*rxC  +  P-M-^GG*)-1^  -  Cme 
cov(e|Z)  =  (C*(GG*)_1C  +  P-1)-1 


(5.1-14) 

(5.1-15) 


We  will  have  much  occasion  to  contrast  the  form  of  Equations  (5.1-12)  and  (5.1-13)  with  the  form  of 
Equations  (5.1-14)  and  (5.1-15).  We  will  call  Equations  (5.1-12)  and  (5.1-13)  the  covariance  form  because  they 
are  In  terms  of  the  uninverted  covariances  P  and  GG*.  Equations  (5.1-14)  and  (5.1-15)  are  called  the  Infor¬ 
mation  form  because  they  are  In  terms  of  the  Inverses  P*x  and  (GG*)-1,  which  are  related  to  the  amount  of 
Information.  (The  larger  the  covariance,  the  less  Information  you  have,  and  vice  versa.)  Equation  (5.1-15) 
has  an  Interpretation  as  addition  of  information:  P_1  Is  the  amount  of  prior  Information  about  £,  and 
C*(GG*)_1C  Is  the  amount  of  Information  In  the  measurement;  the  total  Information  after  the  measurement  Is 
the  sum  of  these  two  terms. 

5.1.2  A  Posteriori  Estimators 

Let  us  first  examine  the  three  types  of  estimators  that  are  based  on  the  posterior  distribution  p(£|Z). 
These  three  types  of  estimators  are  a  posteriori  expected  value,  maximum  a  posteriori  probability,  and 
Bayesian  minimum  risk. 

We  previously  derived  the  expression  for  the  a  posteriori  expected  value  In  the  process  of  defining  the 
posterior  distribution.  Either  the  cov?r1ance  or  Information  form  can  be  used.  We  will  use  the  information 
form  because  It  ties  In  with  other  approaches  as  will  be  seen  below.  The  a  posteriori  expected  value 
estimator  Is  thus 


£  =  m?  +  (C*GG*)“lC  +  P‘1)-1C*(GG*)-1(Z  -  Cm?  -  D) 


(5.1-16) 


The  maximum  a  posteriori  probability  estimate  Is  equal  to  the  a  posteriori  expected  value  because  the 
posterior  distribution  Is  Gaussian  (and  thus  unlmodal  and  symmetric  about  Its  mean).  This  fact  suggests  an 
alternate  derivation  of  Equation  (5.1-16)  which  is  quite  enlightening.  To  find  the  maximum  point  of  the 
posterior  distribution  of  £  given  Z,  write 
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in  p(€ | Z)  -  in  p(Z|g)  +  in  p(g)  -  tn  p(Z)  (5.1-17) 

Expanding  this  equation  using  Equations  (5.1-3)  and  (5.1-6)  gives 

in  p(c | Z)  «  -  £  (Z  -  Cg  -  D)*(GG*)-l(Z  -  Cg  -  D)  -  }  (g  -  n^)*rl(g  -  m?)  +  a(Z)  (5.1-18) 

where  a(Z)  Is  a  function  of  Z  only.  Equation  (5.1-18)  shows  the  problem  In  Its  "least  squares"  form.  He 
are  attempting  to  choose  g  to  minimize  (5  -  mg)  and  IZ  -  Cg  -  D).  The  matrices  P~l  and  (GG*)'1  are 
weightings  used  In  the  cost  functions.  The  larger  the  value  of  (GG*)'1,  the  more  Importance  is  placed  on 
minimizing  (Z  -  Cg  -  D),  and  vice  versa. 

Obtain  the  estimate  g  by  setting  the  gradient  of  Equation  (5.1-18)  to  zero,  as  suggested  by  Equa¬ 
tion  (3.5-17). 

0  -  C*(GG*)-1(Z  -  Cg  -  D)  -  P_1(g  -  m  )  (5.1-19) 

Write  this  as 

0  =  C*(GG*)_1(Z  -  Cm?  -  D)  -  P_1(g  -  m?)  -  C*(GG*)_1C(g  -  m^)  (5.1-20) 

and  the  solution  Is 

g  =  m  +  (C*(GG*)_1C  +  P"1)'lC*(GG*)_1(Z  -  Cm^  -  D)  (5.1-21) 

assuming  that  the  inverses  exist.  For  Gaussian  distributions.  Equation  (3.5-18)  gives  the  covariance  as 

cov(g|Z)  *  -[v|  tn  p(glZ)]-1  -  (C(GG*)-1C  +  P'1)'1  (5.1-22) 

Note  that  the  second  gradient  Is  negative  definite  (and  the  covariance  positive  definite),  verifying  that  the 
solution  Is  a  maximum  of  the  posterior  probability  density  function.  This  derivation  does  not  require  the  use 
of  matrix  Inversion  lemmas,  or  the  expression  from  Chapter  3  for  the  Gaussian  conditional  distribution.  For 
more  complicated  problems,  such  as  conditional  distributions  of  N  jointly  Gaussian  vectors,  the  alternate 
derivation  as  In  Equations  (5.1-17)  to  (5.1-22)  Is  much  easier  than  the  straightforward  derivation  as  In 
Equations  (5.1-10)  to  (5.1-15). 

Because  of  the  symnetry  of  the  posterior  distribution,  the  Sayeslan  optimal  estimate  Is  also  equal  to 
the  a  posteriori  expected  value  estimate  If  the  Bayes  loss  function  meets  the  criteria  of  Theorem  (4.3-1). 


We  will  now  examine  the  statistical  properties  of  the  estimator  given  by  Equation  (5.1-16).  Since  the 
estimator  Is  a  linear  function  of  Z,  the  bias  Is  easy  to  compute. 

b(g)  =  E{g|g)  -  g 

=  E{m^  +  (C*(GG*)-1C  +  P-1)'1C*(GG*)_l(Z  -  Cm?  -  D)|g)  -  g 

=  +  (C*(GG*)'XC  +  P'1)-1C*(GG*)-1[E{Z|g}  -  Cm?  -  D]  -  g 

=  m?  +  (C*(GG*)'XC  +  P'l)'lC*(GG*)”l(Cg  +  D  -  Cm^  -  D)  -  g 

=  [I  -  (C*(GG*)'XC  +  P_l)_1C*(GG*)_1C](me  -  g)  (5.1-23) 

The  estimator  Is  biased  for  all  finite  nonsingular  P  and  GG*.  The  scalar  case  gives  some  insight  into  this 
bias.  If  g  is  scalar,  the  factor  in  brackets  In  Equation  (5.1-23)  lies  between  0  and  1.  As  GG*  decreases 
and/or  P  Increases,  the  factor  approaches  0,  as  does  the  bias.  In  this  case,  the  estimator  obtains  less 
Information  from  the  Initial  guess  of  g  (which  has  large  covariance),  and  more  Information  from  the  measure¬ 
ment  (which  has  small  covariance).  If  the  situation  Is  reversed,  GG*  Increasing  and/or  P  decreasing,  the 
bias  becomes  larger.  In  this  case,  the  estimator  shows  an  increasing  predilection  to  Ignore  the  measured 
response  and  to  keep  the  initial  guess  of  g. 

The  variance  and  mean  square  error  are  also  easy  to  compute.  The  variance  of  g  follows  directly  from 
Equations  (5.1-16)  and  (5.1-5): 

cov(g | g )  =  (C*(GG*)'XC  +  P'l)'lC*(GG*)'lGG*(GG*)-1C(C*(GG*)-1C  +  P'1)'1 


=  (C*(GG*)'XC  +  P_1 )-xC*(6G*)—1C(C*(G6*)_1C  +  P'1)'1 


(5.1-24) 


The  mean  square  errbr  Is  then 

mse(g)  =  cov(g|g)  +  b(g)b(g)* 
which  is  evaluated  using  Equations  (5.1-23)  and  (5.1-24). 


The  most  obvious  question  to  ask  In  relation  to  Equations  (5.1-24)  and  (5.1-25)  is  how  they  compare  with 
other  estimators  and  with  the  Cramer-Rao  bound.  Let  us  evaluate  the  Cramer-Rao  bound.  The  Fisher  Information 
matrix  (Equation  (4.2-19))  Is  easy  to  compute  using  Equation  (5.1-6): 

M  =  E{C*(GG*)‘l(Z  -  Cg  -  D)(Z  -  Cg  -  D)*(GG*)"1C> 


=  C*(GG*)-1GG*(GG*)'1C  =  C*(GG*)_1C 


(5.1-26) 
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Thus  the  Cramer-Rao  bound  for  unbiased  estimators  Is 

mse(e|0  a  (C*(G6*)-lC)_1 
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(5.1-27) 


Note  that,  for  some  values  of  c,  the  a  posteriori  expected  value  estimator  has  a  lower  mean-square  error  than 
the  Cramer-Rao  bound  for  unbiased  estimators;  naturally,  this  Is  because  the  estimator  Is  biased.  To  compute 
the  Cramer-Rao  bound  for  an  estimator  with  bias  given  by  Equation  (5.1-23) ,  we  need  to  evaluate 

1  +  v£bU)  -  I  +  (C*(GG*)*1C  +  P-^^CMGG*)-^  -  I 

-  (C*(GG*)-1C  +  P-l)-1C*(GG*)“lC  (5.1-28) 

The  Cramer-Rao  bound  Is  then  (from  Equation  (4.2-10)) 

mse(c|5)  2  (C*(GG*)_1C  +  P-1)'1C*(GG*)“1C(C*(6G*)-1C  +  P*1)-1  (5.1-29) 

Note  that  the  estimator  does  not  achieve  the  Cramer-Rao  bound  except  at  the  single  point  $  *  m*.  At  every 
other  point,  the  second  term  In  Equation  (5.1-25)  Is  positive,  and  the  first  term  Is  equal  to  tne  bound; 
therefore,  the  mse  Is  greater  than  the  bound. 

For  a  single  observation,  we  can  say  In  summary  that  the  a  posteriori  estimator  Is  optimal  Bayesian  for 
a  large  class  of  loss  functions,  but  It  Is  biased  and  does  not  achieve  the  Cramer-Rao  lower  bound.  It  remains 
to  Investigate  the  asymptotic  properties.  The  asymptotic  behavior  of  estimators  for  static  systems  Is  defined 
in  terms  of  N  Independent  repetitions  of  the  experiment,  where  N  approaches  Infinity.  We  must  first  define 
the  application  of  the  a  posteriori  estimator  to  repeated  experiments. 

Assume  that  the  system  model  Is  given  by  Equation  (5.1-1),  with  5  distributed  according  to  Equa¬ 
tion  (5.1-3).  Perform  N  experiments  (I*. ..Ur.  (It  does  not  matter  whether  the  U|  are  distinct.)  The 
corresponding  system  matrices  are  Cf,  Dj,  and  G-|G?,  and  the  measurements  are  Zi.  The  random  noise  <»4  Is  an 
Independent,  zero-mean.  Identity  covariance,  Gaussian  vector  for  each  1.  The  maximum  a  posteriori  estimate  of 
5  Is  given  by 

‘  N 

l  r— v 

(5.1-30) 
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assuming  that  the  Inverses  exist. 


The  asymptotic  properties  are  defined  for  repetition  of  the  same  experiment,  so  we  do  not  need  the  full 
generality  of  Equation  (5.1-30).  If  Of  =  U,,  Ci  =  Cj,  Di  *  Dj,  and  Gi  -  Gj  for  all  1  and  j.  Equa¬ 
tion  (5.1-30)  can  be  written 

N 

5  =  me  +  [NC*(GG*)-1C  +  p-1]-1C*(GG*)'1  (Z<  -  Cm5  -  D) 

1=i 


(5.1-31) 


Compute  the  bias,  covariance,  and  mse  of  this  estimate  In  the  same  manner  as  Equations  (5.1-23) 
to  (5.1-25): 


b(0  =  [I  -  (NC*(GG*)-1C  +  P_l )_1NC*(GG*)_1C] (m^  -  e) 
cov(t|e)  =  [NC*(GG*)’1C  +  P-1]-1NC*(GG*)‘1C[NC*(GG*)-1C  +  P*1]"1 
mse(cU)  =  cov(c|5)  +  b(s)b(e)* 

The  Cramer-Rao  bound  for  unbiased  estimators  Is 

mse(e|e)  2  (NC*(GG*)-1C)-1 


(5.1-32) 

(5.1-33) 

(5.1-34) 

(5.1-35) 


As  N  increases.  Equation  (5.1-32)  goes  to  zero,  so  the  estimator  Is  asymptotically  unbiased.  The  effect  of 
Increasing  N  is  exactly  comparable  to  Increasing  (gG*)”1;  as  we  take  more  and  better  quality  measurements, 
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the  estimator  depends  more  heavily  on  the  measurements  and  less  on  Its  initial  guess. 

The  estimator  Is  also  asymptotically  efficient  as  defined  by  Equation  (4.2-28)  because 

NC*(GG*)_1C  cov(clc)  — *•  I 

N 

(5.1-36) 
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(5.1-37) 
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5.1.3  Maximum  Likelihood  Estimator 

The  derivation  of  the  expression  for  the  maximum  likelihood  estimator  is  similar  to  the  derivation  of  the 
maximum  a  posteriori  probability  estimator  done  in  Equations  (5.1-17)  to  (5.1-22).  The  only  difference  Is 
that  Instead  of  in  p(?|Z),  we  maximize 


in  p(Z|e)  = 


\  (Z 
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(5.1-38) 
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The  only  relevant  difference  between  Equation  (5.1-38)  and  Equation  (5.1-18)  Is  the  Inclusion  of  the  term  based 
on  the  prior  distribution  of  £  In  Equation  (5.1-18).  (The  a(z)  are  also  different,  but  this  Is  of  no  con¬ 
sequence  at  the  moment.)  The  maximum  likelihood  estimate  does  not  make  use  of  the  prior  distribution;  Indeed 
It  does  not  require  that  such  a  distribution  exist.  We  will  see  that  many  of  the  MLE  results  are  equal  to  the 
MAP  results  with  the  terms  from  the  prior  distribution  omitted. 

Find  the  maximum  point  of  Equation  (5.1-38)  by  setting  the  gradient  to  zero. 

0  =  C*(GG*)"X(Z  -  C£  -  D)  (5.1-39) 

The  solution,  assuming  that  C*(GG*)"XC  Is  nonsingular.  Is  given  by 

£  -  (C*(GG*)“XC)~XC*(GG*)”X(Z  -  D)  (5.1-40) 

This  Is  the  same  form  as  that  of  the  MAP  estimate.  Equation  (5.1-21),  with  P“x  set  to  zero. 

A  particularly  simple  case  occurs  when  C  ■  I  and  0  ■  0.  In  this  event.  Equation  (5.1-40)  reduces  to 
5  -  Z. 

Note  that  the  expression  (C*(GG*)"xC)"xC*(GG*rx  Is  a  left-inverse  of  C;  that  Is 

[(C*(GG*)-xC)'lC*(GG*)"x]C  -  I  (5.1-41) 

We  can  view  the  estimator  given  by  Equation  (5.1-40)  as  a  pseudo- Inverse  of  the  system  given  by  Equa¬ 
tion  (5.1-1).  Using  both  equations,  write 

5  *  (C*(GG*)*XC)"XC*(GG*)“X(C£  +  D  +  Gw  -  D) 

•  5  +  (C*(G6*rxCrxC*(G6*rxGM 

«  £  +  (C*(GG*)“xC)_lC*G*"xu  (5.1-42) 

Although  we  must  use  Equation  (5.1-40)  to  compute  £  because  £  and  u  are  not  known.  Equation  (5.1-42) 

Is  useful  In  analyzing  and  understanding  the  behavior  of  the  estimator.  One  Interesting  point  Is  Immediately 
obvious  from  Equation  (5.1-42):  the  estimate  Is  simply  the  sum  of  the  true  value  plus  the  effect  of  the  con¬ 
taminating  noise  u.  For  the  particular  realization  u>  *  0,  the  estimate  Is  exactly  equal  to  the  true  value. 
This  property,  which  Is  not  shared  by  the  a  posteriori  estimators.  Is  closely  related  to  the  bias.  Indeed, 
the  bias  of  the  maximum  likelihood  estimator  Is  Immediately  evident  from  Equation  (5.1-42). 

b(£)  -  E<$|£}  -  £  -  0  (5.1-43) 

The  maximum  likelihood  estimate  Is  thus  unbiased.  Note  that  Equation  (5.1-32)  for  the  MAP  bias  gives  the  same 
result  If  we  substitute  0  for  P'x. 

Since  the  estimator  Is  unbiased,  the  covariance  and  mean  square  error  are  equal.  Using  Equation  (5.1-42), 
they  are  given  by 


cov(£|£)  -  mse(£|£)  -  (C*(GG*)-xC)-lC*G*-xG-xC(C*(GG*)_xC)-x 

-  (C*(GG*)_XC)_X  (5.1-44) 

We  can  also  obtain  this  result  from  Equations  (5.1-33)  and  (5.1-34)  for  the  MAP  estimator  by  substituting  0 
for  P-x. 


We  previously  computed  the  Cramer-Rao  bound  for  unbiased  estimators  for  this  problem  (Equation  5.1-27)). 
The  mean  square  error  of  the  maximum  likelihood  estimator  is  exactly  equal  to  the  Cramer-Rao  bound.  The  maxi¬ 
mum  likelihood  estimator  Is  thus  efficient  and  Is,  therefore,  a  minimum  variance  unbiased  estimator.  The 
maximum  likelihood  estimator  Is  not.  In  general,  Bayesian  optimal,  Bayesian  optimality  may  not  even  be 
defined,  since  £  need  not  be  random. 

The  MLE  results  for  repeated  experiments  can  be  obtained  from  the  corresponding  MAP  equations  by  substi¬ 
tuting  zero  for  P_x  and  m5.  We  will  not  repeat  these  equations  here. 

5.1.4  Comparison  of  Estimators 

We  have  seen  that  the  maximum  likelihood  estimator  Is  unbiased  and  efficient,  whereas  the  a  posteriori 
estimators  are  only  asymptotically  unbiased  and  efficient.  On  the  other  hand,  the  a  posteriori  estimators  are 
Bayesian  optimal  for  a  large  class  of  loss  functions.  Thus  neither  estimator  emerges  as  an  unchallenged 
favorite.  The  reader  might  reasonably  expect  some  guidance  as  to  which  estimator  to  choose  for  a  given 
problem. 

The  roles  of  the  two  estimators  are  actually  quite  distinct  and  well-defined.  The  maximum  likelihood 
estimator  does  the  best  possible  job  (In  the  sense  of  minimum  mean  square  error)  of  estimating  the  value  of  £ 
based  on  the  measurements  alone,  without  prejudice  (bias)  from  any  preconceived  guess  about  the  value.  The 
maximum  likelihood  estimator  Is  thus  the  obvious  choice  when  we  have  no  prior  Information.  Having  no  prior 
Information  Is  analogous  to  having  a  prior  distribution  with  Infinite  variance;  l.e.,  P*x  *  0.  In  this  regard, 
examine  Equation  (5.1-16)  for  the  a  posteriori  estimate  as  P_x  goes  to  zero.  The  limit  Is  (assuming  that 
C*(GG*)“lC  Is  nonsingular) 


SI 


e  •  *£  +  (C*(88*)“lC)"‘C*(M*)_l(Z  -  Cwc  -  D) 

■  m£  -  (C*(G6**-1C)*1C*(S6*)-1C«C  +  (C*(S6*)-1C)-1C*(G6*)-l(Z  -  0) 

-  (C*(G6*)‘1C)"1C*(GG*)“1(Z  -  D)  (5.1-45) 

which  Is  equal  to  the  maximum  likelihood  estimate.  The  maximum  likelihood  estimate  Is  thus  a  limiting  case 
of  an  a  posteriori  estimator  as  the  variance  of  the  prior  distribution  approaches  Infinity. 

The  a  posteriori  estimate  combines  the  Information  from  the  measurements  with  the  prior  Information  to 
obtain  the  optimal  estimate  considering  both  sources.  This  estimator  makes  use  of  more  Information  and  thus 
can  obtain  more  accurate  estimates,  on  the  average.  With  this  Improved  average  accuracy  comes  a  bias  In  favor 
of  the  prior  estimate.  If  the  prior  estimate  Is  good,  the  a  posteriori  estimate  will  generally  be  more  accu¬ 
rate  than  the  maximum  likelihood  estimate.  If  the  prior  estimate  Is  poor,  the  a  posteriori  estimate  will  be 
poor.  The  advantages  of  the  a  posteriori  estimators  thus  depend  heavily  on  the  accuracy  of  the  prior  estimate 
of  the  value. 

The  basic  criterion  In  deciding  whether  to  use  an  MAP  or  MLE  estimator  Is  whether  you  want  estimates  based 
on'y  on  the  current  data  or  based  on  both  the  current  data  and  the  prior  Information.  The  MLE  estimate  Is 
based  only  on  the  current  data,  and  the  MAP  estimate  Is  based  on  both  the  current  data  and  the  prior 
distribution. 

The  distinction  between  the  MLE  and  MAP  estimators  often  becomes  blurred  In  practical  application.  The 
estimators  are  closely  related  In  numerical  computation,  as  well  as  In  theory.  An  MAP  estimate  can  be  an 
Intermediate  computational  step  to  obtaining  a  final  MLE  estimate,  or  vice  versa.  The  following  paragraphs 
describe  one  of  these  situations;  the  other  situation  Is  discussed  In  Section  5.2.2. 

It  Is  quite  common  to  have  a  prior  guess  of  the  parameters,  but  to  desire  an  Independent  verification  of 
the  value  based  on  the  measurements  alone.  In  this  case,  the  maximum  likelihood  estimator  Is  the  appropriate 
tool  In  order  to  make  the  estimates  Independent  of  the  Initial  guess. 

A  two-step  estimation  Is  often  the  most  appropriate  to  obtain  maximum  Insight  Into  a  problem.  First,  use 
the  maximum  likelihood  estimator  to  obtain  the  best  estimates  based  on  the  measurements  alone.  Ignoring  any 
prior  Information.  Then  consider  the  prior  Information  In  order  to  obtain  a  final  best  estimate  based  on  both 
the  measurements  and  the  prior  Information.  By  this  two-step  approach,  we  can  see  where  the  Information  Is 
coming  from- the  prior  distribution,  the  measurements,  or  both  sources.  The  two-step  approach  also  allows  the 
freedom  to  Independently  choose  the  methodology  for  each  step.  For  Instance,  we  might  desire  to  use  a  maximum 
likelihood  estimator  for  obtaining  the  estimates  based  on  the  measurements,  but  use  engineering  judgment  to 
establish  the  best  compromise  between  the  prior  expectations  and  the  maximum  likelihood  results.  This  Is  often 
the  best  approach  because  It  may  be  difficult  to  completely  and  accurately  characterize  the  prior  Information 
In  terms  of  a  specific  probability  distribution.  The  prior  Information  often  Includes  heuristic  factors  such 
as  the  engineer's  judgment  of  whet  would  constitute  reasonable  results. 

The  theory  of  sufficient  statistics  (Ferguson,  1967;  Cramer,  1946;  and  Fisher,  1921)  Is  useful  In  the 
two-step  approach  If  we  desire  to  use  statistical  techniques  for  both  steps.  The  maximum  likelihood  estimate 
and  Its  covariance  form  a  sufficient  statistic  for  this  problem.  Although  we  will  not  go  Into  detail  here. 

If  we  know  the  maximum  likelihood  estimate  and  Its  covariance,  we  know  all  of  the  statistically  useful  Informa¬ 
tion  that  can  be  extracted  from  the  data.  The  specific  application  Is  that  the  a  posteriori  estimates  can  be 
written  In  terms  of  the  maximum  likelihood  estimate  and  Its  covariance  Instead  of  as  a  direct  function  of  the 
data.  The  following  expression  Is  easy  to  verify  using  Equations  (5.1-16),  (5.1-40),  and  (5.1-44): 

£a  *  me  +  (Q-1  +  P-M'MrMe*.  -  m5)  (5.1-46) 

where  £a  Is  the  a  posteriori  estimate  (Equation  (5.1-16)),  tML  1*  the  maximum  likelihood  estimate  (Equa¬ 

tion  (5.1-40)),  and  Q  Is  the  covariance  of  the  maximum  likelihood  estimate  (Equation  (5.1-44)).  In  this 
form,  the  relationship  between  the  a  posteriori  estimate  and  the  maximum  likelihood  estimate  Is  plain.  The 
prior  distribution  Is  the  only  factor  which  enters  Into  the  relationship;  It  has  nothing  directly  to  do  with 
the  measured  data  or  even  with  what  experiment  was  performed. 

Equation  (5.1-46)  Is  closely  related  to  the  measurement-partitioning  Ideas  of  the  next  section.  Both 
relate  to  combining  data  from  two  different  sources. 


5.2  PARTITIONING  IN  ESTIMATION  PROBLEMS 

Partitioning  estimation  problems  has  some  of  the  same  benefits  as  partitioning  optimization  problems.  A 
problem  half  the  size  of  the  original  typically  takes  well  less  than  half  the  effort  to  solve.  Therefore,  we 
can  often  come  out  ahead  by  partitioning  a  problem  Into  smaller  subproblems.  Of  course,  this  trick  only  works 
If  the  solutions  to  the  subproblems  can  easily  be  combined  to  give  a  solution  to  the  original  problem. 

Two  kinds  of  partitioning  applicable  to  parameter  estimation  problems  are  measurement  partitioning  and 
parameter  partitioning.  Both  of  these  schemes  permit  easy  combination  of  the  subproblem  solutions  In  some 
situations. 

5.2.1  Measurement  Partitioning 

A  problem  with  multiple  measurements  can  often  be  partitioned  Into  a  sequence  of  subproblems  processing 
the  measurements  one  at  a  time.  The  same  principle  applies  to  partitioning  a  vector  measurement  into  a  series 
of  scalar  (or  shorter  vector)  measurements;  the  only  dlfferenve  Is  notatlonal. 
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Th«  estimators  under  discussion  are  all  based  on  p(Zjc)  or.  for  a  po$t*riori  estimators,  p(c|Z).  Me 
will  Initially  consider  measurement  partitioning  as  a  problem  In  factoring  these  density  functions. 

Let  the  measurement  Z  be  partitioned  Into  two  measurements,  Z,  and  Z2.  (Extensions  to  more  than  two 
partitions  follow  the  same  principles.)  Me  would  like  to  factor  p(2| c)  Into  separate  factors  dependent  on 
Z2  and  Zt.  By  Bayes1  rule,  we  can  always  write 

p(z|0  .  p(Zt|Zk.c)p(i1|e)  (5.2-1) 

This  form  does  not  directly  achieve  the  required  separation  because  p(Za | Zx »C)  Involves  both  Z2  and  Z2.  To 
achieve  the  required  separation,  we  Introduce  the  requirement  that 

P(Z,|Z1.0  -  P(Z,|C)  (5.2-2) 


We  will  call  this  the  Markov  criterion. 

Heurlstlcally,  the  Markov  criterion  assures  that  p(Z2|e)  contains  all  of  the  useful  Information  we  can 
extract  from  Z2.  Therefore,  having  computed  p(Z,|E)  at  the  measured  value  of  Z2,  we  have  no  further  need 
for  Z2.  If  the  Markov  criterion  does  not  hold,  then  there  are  Interactions  that  require  Z2  and  Za  to  be 
considered  together  Instead  of  separately.  For  systems  with  additive  noise,  the  Markov  criterion  Implies  that 
the  noise  In  Z2  Is  Independent  of  that  In  Z2.  Note  that  this  does  not  mean  that  Z2  Is  Independent  of  Z2. 

For  systems  where  the  Markov  criterion  holds,  we  can  substitute  Equation  (5.2-2)  Into  Equation  (5.2-1) 
to  get 


p(Z|0  -  p(Z2|e)p(Z2|e)  (5.2-3) 

which  Is  the  desired  factorization  of  p(Z|c). 

When  e  has  a  prior  distribution,  the  factorization  of  p(c|Z)  follows  from  that  of  p(Z| c) • 


P(E|Z)  ■ 


pttJOpUJOpU) 
- pm - 


(5.2-4) 


The  mixing  of  Zx  and  Z2  In  the  p(Z)  In  the  denominator  Is  not  Important,  because  the  denominator  Is  merely 
a  normalizing  constant.  Independent  of  E-  It  will  prove  convenient  to  write  Equation  (5.2-4)  In  the  form 


P(€|Z) 


p(ZtU)pU|Zx) 
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(5.2-5) 


Let  us  now  consider  measurement  partition  of  an  MAP  estimator  for  a  system  with  p(e|Z)  factored  as  In 
Equation  (5.2-5).  The  MAP  estimate  Is 

5  »  arg  max  p(Zz|e)p(e|Z  )  (5.2-6) 

E 

This  equation  Is  Identical  In  form  to  Equation  (4.3-18),  with  pU(Z.)  playing  the  role  of  the  prior  distribu¬ 
tion.  Me  have,  therefore,  the  following  two-step  process  for  obtaining  the  MAP  estimate  by  measurement 
partitioning: 

First,  evaluate  the  posterior  distribution  of  e  given  Z, .  This  Is  a  function  of  E»  rather  than  a  single 
value.  Practical  application  demands  that  this  distribution  Be  easily  representable  by  a  few  statistics,  but 
we  put  off  such  considerations  until  the  next  section.  Then  use  this  as  the  prior  distribution  for  an  MAP 
estimator  with  the  measurement  Z2.  Provided  that  the  system  meets  the  Markov  criterion,  the  resulting  esti¬ 
mate  should  be  Identical  to  that  obtained  by  the  unpartitioned  MAP  estimator. 

Measurement  partitioning  of  MLE  estimator  follows  similar  lines,  except  for  some  Issues  of  Interpretation. 
The  MLE  estimate  for  a  system  factored  as  In  Equation  (5.2-3)  Is 

t  -  arg  max  p(Z2 |^)p(Zx | C )  (5.2-7) 

e 

This  equation  Is  Identical  In  form  to  Equation  (4.3-18),  with  p(Zx|t)  playing  the  role  of  the  prior  distribu¬ 
tion.  The  two  steps  of  the  partitioned  MLE  estimator  are  therefore  as  follows:  first,  evaluate  p(Zx | 5)  at 
the  measured  value  of  Z2,  giving  a  function  of  E.  Then  use  this  function  as  the  prior  density  for  an  MAP 
estimator  with  measurement  Z2.  Provided  that  the  system  meets  the  Markov  criterion,  the  resulting  estimate 
should  be  Identical  to  that  obtained  by  the  unpartitioned  MLE  estimator. 

The  partitioned  MLE  estimator  raises  an  issue  of  Interpretation  of  p(Zx|c).  It  is  not  a  probability 
density  function  of  E.  The  vector  e  need  not  even  be  random.  We  can  avoid  the  Issue  of  E  not  being 
random  by  using  Information  terminology,  considering  p(Zx |c)  to  represent  the  state  of  our  knowledge  of  5 
based  on  Z,  Instead  of  being  a  probability  density  function  of  E.  Alternately,  we  can  simply  consider 
p(Z,|e)  to  be  a  function  of  E  that  arises  at  an  Intermediate  step  of  computing  the  MLE  estimate.  The  process 
described  gives  the  correct  MLE  estimate  of  E»  regardless  of  how  we  choose  to  Interpret  the  intermediate 
steps. 

The  close  connection  between  MAP  and  MLE  estimators  is  Illustrated  by  the  appearance  of  an  MAP  estimator 
as  a  step  in  obtaining  the  MLE  estimate  with  partitioned  measurements.  The  result  can  be  Interpreted  either  as 
an  MAP  estimate  based  on  the  measurement  Z2  and  the  prior  density  p(Z2|e),  or  as  an  MLE  estimate  based  on 
both  Zx  and  Z2. 
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5.2.2  Application  to  Linear  Gaussian  Systems 

We  now  consider  the  application  of  measurement  partitioning  to  linear  systems  with  additive  Gaussian 
noise.  We  will  first  consider  the  partitioned  MAP  estimator,  followed  by  the  partitioned  MLE  estimator. 

Let  the  partitioned  system  be 


lx  ■  Cxc  +  Dx  +  G1u1  (5.2-8a) 

Za  ■  Cjt  +  D2  +  Gjuij  (5.2-8b) 

where  ux  and  w2  are  Independent  Gaussian  random  variables  with  mean  0  and  covariance  1.  The  Markov  criterion 
requires  that  u>x  and  u,  be  Independent  for  measurement  partitioning  to  apply.  The  prior  distribution  of  c 
Is  Gaussian  with  mean  m?  and  covariance  P,  and  Is  Independent  of  w2  and  ut. 

The  first  step  of  the  partitioned  MAP  estimator  Is  to  compute  pU|Z,).  We  have  previously  seen  that  this 
Is  a  Gaussian  density  with  mean  and  covariance  given  by  Equations  (5.1-12)  and  (5.1-13).  Denote  the  mean  and 
covariance  of  pU|Z2.)  by  nij  and  P2.  Then,  Equations  (5.1-12)  and  (5.1-13)  give 

mx  -  me  +  PC*(CXPC*  +  G1G})_1(Zl  -  C,mc  -  Dx)  (5.2-9) 

Px  -  P  -  PC*(CXPC?  +  G1G})‘1CJ.P  (5.2-10) 

The  second  step  Is  to  compute  the  MAP  estimate  of  z  using  the  measurement  Z2  and  the  prior  density 
pU|Zx).  This  step  Is  another  application  of  Equation  (5.1-12),  using  mx  for  m^  and  Px  for  P.  The 
result  Is 

t  *  m4  -  mx  +  P^C^CJ  +  GjGjrMZ,  -  C2mx  -  D2)  (5.2-11) 

The  i  defined  by  Equation  (5.2-11)  Is  the  MAP  estimate.  It  should  exactly  equal  the  MAP  estimate 
obtained  by  direct  application  of  Equation  (5.1-12)  to  the  concatenated  system.  You  can  consider  Equa¬ 
tions  (5.2-9)  through  (5.2-11)  to  be  an  algebraic  rearrangement  of  the  original  Equation  (5.1-12);  Indeed,  they 
can  be  derived  In  such  terms. 

Example  5.2-1  Consider  a  system 


Z  *  Z  +  u 

where  a  Is  Gaussian  with  mean  0  and  covariance  1,  and  z  has  a  Gaussian 
prior  distribution  with  mean  0  and  covariance  1.  We  make  two  Independent 
measurements  of  Z  (l.e.,  the  two  samples  of  u  are  Independent)  and  desire 
the  MAP  estimate  of  Z-  Suppose  the  Zx  measurement  Is  2  and  the  Z2 
measurement  Is  -1. 


Without  measurement  partitioning,  we  could  proceed  as  follows:  write  the 
concatenated  system 


Directly  apply  Equation  (5.1-12)  with  me  ■  0,  P  =  1,  C  =  [1  1]*,  D  =  0, 
G  =  1,  and  Z  =  [2,  -1]*.  The  MAP  estimate  Is  then 


l  •  [1 


■  7  U2  +  Z2) 


1 

I 


Now  consider  this  same  problem  with  measurement  partitioning.  To  get  p(e|Zj), 
apply  Equations  (5.2-9)  and  (5.2-10)  with  mr  =  0,  P  ■  1,  Cx  =  1,  Dx  =  0, 

Gx  =  1,  and  Zx  =  2. 

m2  *  1(2)-%  =  \  Z2  =  1 
Pi  =  1  -  1(Z)_11  =  \ 


For  the  second  step,  apply  Equation  (5.2-11)  with  mx  =  1,  P2  =  1/2,  C2  =  1, 
D2  =  0,  G2  =  1,  and  Z2  =  -1. 

I  =  1  +£  (i  |)_1(Z2  -  D  -  7  Z2  +  §  ■  7 

We  see  that  the  results  of  the  two  approaches  are  identical  in  this  example, 
as  claimed.  Note  that  the  partitioning  removes  the  requirement  to  Invert  a 
2-by-2  matrix,  substituting  two  1-by-l  Inversions. 


irw,aT?*A  ■  A  "-JI  ”  Vi 


l  WaM\  ^ 


''ir'irrx^  v 


*  ■.»*  *.» 
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The  computational  advantages  of  using  the  partitioned  fora  of  the  MAP  estlMtor  vary  depending  on 
numerous  factors.  There  are  numerous  other  rearrangements  of  Equations  (5.1*12)  and  (5.1*13).  The  Information 
form  of  Equations  (5.1-14)  and  (5.1-15)  Is  often  preferable  If  the  required  Inverses  exist.  The  Information 
form  can  also  be  used  In  the  partitioned  estimator,  replacing  Equations  (5.2*9)  through  (5.2-11)  with  corre¬ 
sponding  Information  forms.  Equation  (S.l-30)  Is  another  alternative,  which  Is  often  the  most  efficient. 

There  Is  at  least  one  circumstance  In  which  a  partitioned  form  Is  mandatory.  This  Is  when  the  data 
comes  In  two  separate  batches  and  the  first  batch  of  data  must  be  discarded  (for  any  of  several  reasons- per¬ 
haps  unavailability  of  enough  computer  memory)  before  processing  the  second  batch.  Such  circumstances  occur 
regularly.  Partitioned  estimators  are  also  particularly  appropriate  when  you  have  already  computed  the  esti¬ 
mate  based  on  the  first  batch  of  data  before  receiving  the  second  batch. 

Let  us  now  consider  the  partitioned  NLE  estimator.  The  first  step  Is  to  compute  p(Z1|c).  Equa¬ 
tion  (5.1-38)  gives  a  formula  for  pUJO.  It  Is  immediately  evident  that  the  logarithm  of  p(Z,|c)  Is  a 
quadratic  form  In  5.  Therefore,  although  p(Z,|c)  need  not  be  Interpreted  as  a  probability  density  function 
of  c.  It  has  the  algebraic  fora  of  a  Gaussian  density  function,  except  for  an  Irrelevant  constant  multiplier. 
Applying  Equations  (3.5-17)  and  (3.5-18)  gives  the  mean  and  covariance  of  this  function  as 

»i  "  P1CJ(G16J)-1(Z1  -  OJ  (5.2-12) 

P,  “  -[V*  in  p(Zjc)]-1  -  [CJ(GlGJ)“IC1]“l  (5.2-13) 


The  second  step  of  the  partitioned  MLE  estimator  Is  Identical  to  the  second  step  of  the  partitioned  MAP 
estimator.  Apply  Equation  (5.2-11),  using  the  and  P2  from  the  first  step.  For  the  partitioned  MLE 
estimator.  It  Is  most  natural  (although  not  required)  to  use  the  Information  form  of  Equation  (5.2-11), 
which  Is 


i  -  m1  ♦  P^WMZ,  -  C.m,  -  D,)  (5.2-14) 

Pi  "  [CJ^G;)'1^  +  P;l3‘»  (5.2-15) 

This  form  Is  more  parallel  to  Equations  (5.2-12)  and  (5.2-13). 

Exaemle  5.2-2  Consider  a  maximum  likelihood  estimator  for  the  problem  of 
txasple  S.2-1,  Ignoring  the  prior  distribution  of  i.  To  get  the  MLE 
estimate  for  the  concatenated  system,  apply  Equation  (5.1-40)  with 
C  -  [1  1]*,  D  -  0,  G  -  1,  and  Z  -  [2,  -1]*. 

i  -  (2)-l[l  1]Z  -  I  (Zx  +  Z2)  -  £ 

Now  consider  the  same  problem  with  measurement  partitioning.  For  the  first 
step,  apply  Equations  (5.2-12)  and  (5.2-13)  with  C,  -1,0,-  0,  Gj  >  1, 
and  Zl  -  2. 


Pl  -  [Id)-1]'1  -  1 

m,  -  P^D-MZi  -  0)  -  Z2  -  2 

For  the  second  step,  apply  Equations  (5.2-14)  and  (5.2-15)  with  C,  -  1, 

0,  ■  0,  G,  «  1,  and  Z,  «  -1. 

p2  -  [i(u*1  +  (i)-iri  =  \ 

l  -  2  +  \  (l)'l(Z2  -  2  -  0)  •  1  +  }Z,  -{■ 

The  partitioned  algorithm  thus  gives  the  same  result  as  the  original 
unpartitioned  algorithm. 

There  Is  often  confusion  on  the  Issue  of  the  bias  of  the  partitioned  MLE  estimator.  This  Is  an  MLE  esti¬ 
mate  of  €  based  on  both  Z2  and  Z2.  It  Is,  therefore,  unbiased  like  all  H.E  estimators  for  linear  systems 
with  additive  Gaussian  noise.  On  the  other  hand,  the  last  step  of  the  partitioned  estimator  Is  an  MAP  estimate 
based  on  Z2,  with  a  prior  distribution  described  by  m2  and  P2.  We  have  previously  shown  that  MAP  estimators 
are  biased.  There  Is  no  contradiction  In  these  two  viewpoints.  The  estimate  is  biased  based  on  the  measure¬ 
ment  Z2  alone,  but  unbiased  based  on  Zx  and  Z2. 

Therefore,  It  Is  overly  simplistic  to  universally  condemn  MAP  estimators  as  biased.  The  bias  is  not 
always  so  clear  an  issue,  but  requires  you  to  define  exactly  on  what  data  you  are  basing  the  bias  definition. 
The  primary  basis  for  deciding  whether  to  use  an  MAP  or  MLE  estimator  is  whether  you  want  estimates  based  only 
on  the  current  set  of  data,  or  estimates  based  on  the  current  data  and  prior  information  combined.  The  bias 
merely  reflects  this  decision;  it  does  not  give  you  Independent  help  in  deciding. 

5.2.3  Parameter  Partitioning 

In  parameter  partitioning,  we  write  the  parameter  vector  5  as  a  function  of  two  (or  more- the  general¬ 
izations  are  obvious)  smaller  vectors  and  e2. 


e  =  fUi.c2) 


(5.2-16) 


The  function  f  must  be  Invertible  to  obtain  c,  and  ct  from  z,  or  the  solution  to  the  partitioned  problem 
Mill  not  be  unique.  The  simplest  kind  of  partitions  are  those  In  Mhlch  and  c,  are  partitions  of  the 
C  vector. 

With  the  parameter  5  partitioned  Into  and  we  have  a  partitioned  optimization  problem.  Two 
possible  solution  methods  apply.  The  best  method,  If  it  can  be  used,  Is  generally  to  solve  for  In  terms 
of  Ci  (or  vice  versa)  and  substitute  this  relationship  Into  the  original  problem.  Axial  Iteration  1$  another 
reasonable  method  If  solutions  for  ct  and  ct  are  nearly  Independent  so  that  few  Iterations  are  required. 


5.3  LIMITING  CASES  AND  SINGULARITIES 

In  the  previous  discussions,  we  have  simply  assumed  that  all  of  the  required  matrix  Inverses  exist.  We 
made  this  assumption  to  present  some  of  the  basic  results  without  getting  sidetracked  on  fine  points.  We  will 
now  take  a  comprehensive  look  at  all  of  the  singularities  and  limiting  cases,  explaining  both  the  circumstances 
that  give  rise  to  the  various  special  cases,  and  how  to  handle  such  cases  when  they  occur. 

The  reader  will  recognize  that  most  of  the  special  cases  are  Idealizations  which  are  seldom  literally 
true.  We  almost  never  know  any  value  perfectly  (zero  covariance).  Conversely,  It  Is  rare  to  have  absolutely 
no  Information  about  the  value  of  a  parameter  (Infinite  covariance).  There  are  very  few  parameters  that  would 
not  be  viewed  with  suspicion  If  an  estimate  of,  say,  10,l(  were  obtained.  These  Idealizations  are  useful  In 
practice  for  two  reasons.  First,  they  avoid  the  necessity  to  quantify  statements  such  as  "virtually  perfect" 
when  the  difference  between  virtually  perfect  and  perfect  Is  not  of  measurable  consequence  (although  one  must 
be  careful:  sometimes  even  an  extremely  small  difference  can  be  crucial).  Second,  numerical  problems  with 
finite  arithmetic  can  be  alleviated  by  recognizing  essentially  singular  situations  and  treating  them  specially 
as  though  they  were  exactly  singular. 

We  will  address  two  kinds  of  singularities.  The  first  kind  of  singularity  Involves  Gaussian  distributions 
with  singular  covariance  matrices.  These  are  perfectly  valid  probability  distributions  conforming  to  the  usual 
definition.  The  distributions,  however,  do  not  have  density  functions;  therefore  the  maximum  a  posteriori 
probability  and  maximum  likelihood  estimates  cannot  be  defined  as  we  have  done.  The  singularity  Implies  that 
the  probability  distribution  Is  entirely  concentrated  on  a  subspace  of  the  originally  defined  probability 
space.  If  the  problem  statement  is  redefined  to  Include  only  the  subspace,  the  restricted  problem  Is  nonsingu¬ 
lar.  You  can  also  address  this  singularity  by  looking  at  limits  as  the  covariance  approaches  the  singular 
matrix,  provided  that  the  limits  exist. 

The  second  kind  of  singularity  Involves  Gaussian  variables  with  Infinite  covariance.  Conceptually,  the 
meaning  of  Infinite  covariance  Is  easily  stated-we  have  no  Information  about  the  value  of  the  variable  (but 
we  must  be  careful  about  generalizing  this  Idea,  particularly  In  nonlinear  transformations- see  the  discussion 
at  the  end  of  Section  4.3.4).  Unluckily,  Infinite  covariance  Gausslans  do  not  fit  within  the  strict  defini¬ 
tion  of  a  probability  distribution.  (They  cannot  meet  axiom  2  In  Section  3.1.1.)  For  current  purposes,  we 
need  only  recognize  that  an  Infinite  covariance  Gaussian  distribution  can  be  considered  as  a  limiting  case  (In 
some  sense  that  we  will  not  precisely  define  here)  of  finite  covariance  Gausslans.  The  term  "generalized 
probability  distribution”  Is  sometimes  used  In  connection  with  such  limiting  arguments.  The  equations  which 
apply  to  the  Infinite  covariance  case  are  the  limits  of  the  corresponding  finite  covariance  cases,  provided 
that  the  limits  exist.  The  primary  concern  In  practice  Is  thus  how  to  compute  the  appropriate  limits. 

We  could  avoid  several  of  the  singularities  by  retreating  to  a  higher  level  of  abstraction  In  the  mathe¬ 
matics.  The  theory  can  consistently  treat  Gaussian  variables  with  singular  covariances  by  replacing  the  con¬ 
cept  of  a  probability  density  function  with  the  more  general  concept  of  a  Radon-NIkodym  derivative.  (A 
probability  density  function  Is  a  specific  case  of  a  Radon-NIkodym  derivative.)  Although  such  variables  do 
not  have  probability  density  functions,  they  do  have  Radon-NIkodym  derivatives  with  respect  to  appropriate 
measures.  Substituting  the  more  general  and  more  abstract  concept  of  o-flnlte  measures  in  place  of  probabil¬ 
ity  measures  allows  strict  definition  of  Infinite  covariance  Gaussian  variables  within  the  same  context. 

This  level  of  abstraction  requires  considerable  depth  of  mathematical  background,  but  changes  little  in 
the  practical  application.  We  can  derive  the  identical  computational  methods  at  a  lower  level  of  abstraction. 
The  abstract  theory  serves  to  place  all  of  the  theoretical  results  in  a  common  framework.  In  many  senses  the 
general  abstract  theory  Is  simpler  than  the  more  concrete  approach;  there  are  fewer  exceptions  and  special 
cases  to  consider.  In  Implementing  the  abstract 'theory,  the  same  computational  issues  arise,  but  the  simpli¬ 
fied  viewpoint  can  help  Indicate  how  to  resolve  these  Issues.  Simply  knowing  that  the  problem  does  have  a 
well-defined  solution  is  a  major  aid  to  finding  the  solution. 

The  conceptual  simplification  gained  by  the  abstract  theory  requires  significantly  more  background  than 
we  assume  In  this  book.  Our  emphasis  will  be  on  the  computations  required  to  deal  with  the  singularities, 
rather  than  on  the  abstract  theory.  Royden  (1968),  Rudin  (1974),  and  l  and  Shiryayev  (1977)  treat  such 

subjects  as  o-flnlte  measures  and  Radon-NIkodym  derivatives. 

We  will  consider  two  general  computational  methods  for  treating  singularities.  The  first  method  is  to 
use  alternate  forms  of  the  equations  which  are  not  affected  by  the  singularity.  The  covariance  form  (Equa¬ 
tions  (5.1-12)  and  (5.1-13))  and  the  Information  form  (Equations  (5.1-14)  and  (5.1-15))  of  the  posterior 
distribution  are  equivalent,  but  have  different  points  of  singularity.  Therefore,  a  singularity  In  one  form 
can  often  be  handled  simply  by  switching  to  the  other  form.  This  simple  method  fails  if  a  problem  statement 
has  singularities  In  both  forms.  Also,  we  may  desire  to  stick  with  a  particular  form  for  other  reasons. 

The  second  method  is  to  partition  the  estimation  problem  into  two  parts:  the  totally  singular  part  and 
the  nonsingular  part.  This  partitioning  allows  us  to  use  one  means  of  solving  the  singular  part  and  another 
means  of  solving  the  nonsingular  part;  we  then  combine  the  partial  solutions  to  give  the  final  result. 
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5.3.1  Singular  P 

The  first  case  that  we  will  consider  Is  singular  P.  A  singular  P  matrix  Indicates  that  some  parameter 
or  linear  combination  of  parameters  Is  known  perfectly  before  the  experiment  is  performed.  For  Instance,  we 
might  know  that  £x  ■  5$,  +  3,  even  though  ex  and  £x  are  unknown.  In  this  case,  we  know  the  linear  coiitolna- 
tlon  -  5Cj  exactly.  The  singular  P  matrix  creates  no  problems  If  we  use  the  covariance  form  Instead  of 
the  Information  form.  If  we  specifically  desire  to  use  the  Information  form,  we  can  handle  the  singularity  as 
follows. 

Since  P  Is  always  symmetric,  the  range  and  the  null  space  of  P  form  an  orthogonal  decomposition  of  the 
space  2.  The  singular  eigenvectors  of  P  span  the  null  space,  and  the  nonsingular  eigenvectors  span  the 
range.  Use  the  eigenvectors  to  decompose  the  parameter  estimation  problem  Into  the  totally  singular  subproblem 
and  the  totally  nonsingular  subproblem.  This  Is  a  parameter  partitioning  as  discussed  In  Section  5.2.  The 
totally  singular  subproblem  Is  trivial  because  we  know  the  exact  solution  when  we  start  (by  definition).  Sub¬ 
stitute  the  solution  of  the  singular  problem  In  the  original  problem  and  solve  the  nonsingular  subproblem  In 
the  normal  manner. 

A  specific  Implementation  of  this  decomposition  Is  as  follows:  let  X$  be  the  matrix  of  orthonormal 
singular  eigenvectors  of  P,  and  be  the  matrix  of  orthonormal  nonsingular  eigenvectors.  Then  define 

£s  -  Xge  (5.3-la) 

5NS  =  Xfis5  (5.3-lb) 

The  covariances  of  £j  and  £NS  are 

cov(c$)  ■  X§PX$  -  0  (5.3-2a) 

C0V<W  “  xNSPXNS  *  PNS  (5.3-2b) 

where  Prs  is  nonsingular.  Write 

5  "  XNS*NS  +  XS5S  (5.3-3) 

Substitute  Equation  (5.3-3)  Into  the  original  problem.  Use  the  exactly  known  value  of  £5,  and  restate  the 
problem  in  terms  of  £^5  as  the  unknown  parameter  vector.  Other  decompositions  derived  from  multiplying 
Equation  (5.3-1)  by  nonsingular  transformations  can  be  used  If  they  have  advantages  for  specific  situations. 

We  will  henceforth  assume  that  P  Is  nonsingular.  It  Is  unimportant  whether  the  original  problem 
statement  Is  nonsingular  or  we  are  working  with  the  nonsingular  subproblem. 

The  Implementation  above  Is  defined  In  very  general  terms,  which  would  allow  It  to  be  done  as  an  auto¬ 
matic  computer  subroutine.  In  practice,  we  usually  know  the  fact  of  and  reason  for  the  singularity  beforehand 
and  can  easily  handle  It  more  concretely.  If  an  equation  gives  an  exact  relationship  between  two  or  more 
variables  which  we  know  prior  to  the  experiment,  we  solve  the  equation  for  one  variable  and  remove  that 
variable  from  the  problem  by  substitution. 

Example  5.3-1  Assume  that  the  output  of  a  system  Is  a  known  function  of  the 
applied  force  and  moment 


Z  ■  f(F.M) 

An  unknown  point  force  Is  applied  at  a  known  position  r  referred  to  the 
origin.  We  thus  know  that 


M  =  r  «  F 

If  F  and  M  are  both  considered  as  unknowns,  the  P  matrix  is  singular. 
But  this  singularity  Is  readily  removed  by  substituting  for  M  In  terms  of 
F  so  that  F  Is  the  only  unknown. 

Z  »  f(F,r  x  F)  ■  fj(F) 


5.3.2  Singular  66* 

The  treatment  of  singular  GG*  Is  similar  In  principle  to  that  of  singular  P.  A  singular  GG*  matrix 
Implies  that  some  measurement  or  combination  of  measurements  is  made  perfectly  (l.e.,  noise-free).  The 
covariance  form  does  not  Involve  the  Inverse  of  GG*,  and  thus  can  be  used  with  no  difficulty  when  GG*  is 
singular. 

An. alternate  approach  Involves  a  sequential  decomposition  of  the  original  problem  Into  totally  singular 
(GG*  ■  0)  and  nonsingular  subproblems.  The  totally  singular  subproblem  must  be  handled  in  the  covariance  form; 
the  nonsingular  subproblem  can  then  be  handled  In  either  form.  This  Is  a  measurement  partitioning  as 
described  In  Section  5.2.  Divide  the  measurement  Into  two  portions,  called  the  singular  and  the  nonsingular 
measurements,  z$  and  Z^s-  First  Ignore  Ze  and  find  the  posterior  distribution  of  £  given  only  Znj.  Then 
use  this  result  as  the  distribution  prior  to  Z$.  We  specifically  Implement  this  decomposition  as  follows: 

For  the  first  step  of  the  decomposition,  let  Xrs  he  the  matrix  of  nonsingular  eigenvectors  of  GG*. 
Multiply  Equation  (5.1-1)  on  the  left  by  Xfo  giving 
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Xfis2  ■  xflsc<  +  xfis°  +  Xfis6- 


(5.3-4) 


Defln* 


ZNS  ‘  X«SZ 
CNS  *  XUsC 
°NS  '  Xfis° 
Ss '  xHsG 


(5.3-5) 


Equation  (5.3-4)  then  becomes 


ZNS  "  Ss5  +  °NS  +  Ss“  (5.3-6) 

Note  that  G^cG^s  1s  nonsingular.  Using  the  Information  form  for  the  posterior  distribution,  the  dlstrlbut 
of  £  conditioned  on  Zns  is 

"ks  ■  E«'IZ»S>  ■  "t  *  (cRs<snssBs>“1cns  *  f"‘)"‘clls(6«sG!s>"1|2»s' CNS*[  -  °«s>  I5-3-2*) 

PNS  ■  “,«lz«s>  ■  lCSs<SKSSfe)'‘C«S  *  l5'3-'b> 


the  Information  form  for  the  posterior  distribution,  the  distribution 


(5.3-7b) 


For  the  second  step,  let  X$  be  the  matrix  of  singular  eigenvectors  of  GG*.  Corresponding  to 
Equation  (5.3-6)  Is 

h  “  cs«  +  DS  +  Gs“  ( 


(5.3-8) 


ZS  ’  X?Z 
C$  -  X|C 

°s  ■  XfD 


G$  -  X|G  -  0 


(5.3-9) 


Use  Equation  (5.3-7)  for  the  prior  distribution  for  this  step.  Since  G$  Is  0,  we  must  use  the  covariance 
form  for  the  posterior  distribution,  which  reduces  to 


EU|Z>  *  mas  +  -  CSnNS  "  DS> 

cov{£ |Z)  ■  PNS  +  PHsC^CsPasC^^CjP^ 


(5.3-10a) 

(5.3-10b) 


Equations  (5.3-4),  (5.3-6),  (5.3-8),  and  (5.3-10)  give  an  alternate  expression  for  the  posterior  distribution 
of  £  given  Z  which  we  can  use  when  GG*  is  singular.  It  does  require  that  C$Pnsc5  nonsingular. 

This  is  a  special  case  of  the  requirement  that  CPC*  +  GG*  be  nonsingular,  which  we  discuss  later.  It  Is 
Interesting  to  note  that  the  covariance  (Equation  (5.3-10b) )  of  the  estimate  Is  singular.  Multiply 
Equation  (5.3-IOb)  on  the  right  by  C§  and  obtain 

PNSCS  '  PNSCS^CSPNSCS^  1CSPNSCS  *  PNSCf  '  PNSCS  °  0  (5.3-11) 

Therefore  the  columns  of  C§  are  all  singular  eigenvectors  of  the  covariance  of  the  estimate. 


5.3.3  Singular  CPC*  +  GG* 


The  next  special  case  that  we  will  consider  Is  when  CPC*  +  GG*  Is  singular.  Note  first  that  this  can 
happen  only  when  GG*  Is  also  singular,  because  CPC*  and  GG*  are  both  positive  seml-deflnlte,  and  the  sum 
of  two  such  matrices  can  be  singular  only  If  both  terms  are  singular.  Since  both  GG*  and  CPC*  +  GG*  are 
singular,  neither  the  covariance  form  nor  the  Information  form  circumvents  the  singularity.  In  fact,  there  Is 
no  way  to  circumvent  this  singularity.  If  CPC*  +  GG*  Is  singular,  the  problem  is  Intrinsically  ill-posed. 
The  only  solution  Is  to  restate  the  original  problem. 


If  we  examine  what  Is  Implied  by  a  singular  CPC*  +  GG*,  we  will  be  able  to  see  why  it  necessarily  means 
that  the  problem  Is  Ill-posed,  and  what  kinds  of  changes  in  the  problem  statement  are  required.  Referring  to 
Equation  (5.1-8),  we  see  that  CPC*  +  GG*  Is  the  covariance  of  the  measurement  Z.  GG*  is  the  contribution 
of  the  measurement  noise  to  this  covariance,  and  CPC*  Is  the  contribution  due  to  the  prior  variance  of  £. 
If  CPC*  +  GG*  is  singular,  we  can  exactly  predict  some  part  of  the  measured  response.  For  this  to  occur, 
there  must  be  neither  measurement  noise  nor  parameter  uncertainty  affecting  that  particular  part  of  the 
response. 
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Clearly,  there  are  serious  mathematical  difficulties  In  saying  that  we  know  exactly  what  the  measured 
value  will  be  before  taking  the  measurement.  At  best,  the  measurement  can  agree  with  what  we  predicted,  which 
adds  no  new  Information.  If,  however,  there  Is  any  disagreement  at  all,  even  due  to  rounding  error  In  the 
computations,  there  Is  an  Irresolvable  contradlctlon-we  said  that  we  knew  exactly  what  the  value  would  be  and 
we  were  wrong.  This  Is  one  situation  where  the  difference  between  almost  perfect  and  perfect  Is  extremely 
Important.  As  CPC*  *  GG*  approaches  singularity,  the  corresponding  estimators  diverge;  we  cannot  talk  about 
the  limiting  case  because  the  estimators  do  not  converge  to  a  limit  In  any  meaningful  sense. 

5.3.4  Infinite  P 

Up  to  this  point,  the  special  cases  considered  have  all  Involved  singular  covariance  matrices,  correspond¬ 
ing  to  perfectly  known  quantities.  The  remaining  special  cases  all  concern  limits  as  eigenvalues  of  a  covar¬ 
iance  matrix  approach  Infinity,  corresponding  to  total  Ignorance  of  the  value  of  a  quantity. 

The  first  such  special  case  to  discuss  Is  when  an  eigenvalue  of  P  approaches  Infinity.  The  problem  Is 
much  easier  to  discuss  In  terms  of  the  information  matrix  P‘l.  As  an  eigenvalue  of  P  approaches  Infinity, 
the  corresponding  eigenvalue  of  P*1  approaches  zero.  At  the  limit,  P'1  is  singular.  To  be  cautious,  we 
should  not  speak  of  P‘l  being  singular  but  only  of  the  limit  as  P'1  goes  to  a  singularity,  as  It  Is  not 
meaningful  to  say  that  P*1  Is  singular.  Provided  that  we  use  the  Information  form  everywhere,  all  of  the 
limits  as  P*1  goes  to  a  singularity  are  well-behaved  and  can  be  evaluated  simply  by  substituting  the  singular 
value  for  P*1.  Thus  this  singularity  poses  no  difficulties  In  practice,  as  long  as  we  avoid  the  use  of 
expressions  involving  a  nonlnverted  P.  As  previously  mentioned,  the  limit  as  P*1  goes  to  zero  Is  particu¬ 
larly  Interesting  and  results  In  estimates  Identical  to  the  maximum  likelihood  estimates.  Using  a  singular 
P'1  is  paramount  to  saying  that  there  Is  no  prior  Information  about  some  parameter  or  set  of  parameters  (or 
that  we  choose  to  discount  any  such  Information  In  order  to  obtain  an  Independent  check).  There  Is  no  con¬ 
venient  way  to  decompose  the  problem  so  that  the  covariance  form  can  be  used  with  singular  P"1  matrices. 

The  meaning  of  a  singular  P~*  Is  most  clearly  Illustrated  by  some  exanples  using  confidence  regions.  A 
confidence  region  is  the  area  where  the  probability  density  function  (really  a  generalized  probability  density 
function  here)  Is  greater  than  or  equal  to  some  given  constant.  (See  Chapter  11  for  a  more  detailed  discussion 
of  confidence  regions.)  Let  the  parameter  vector  consist  of  two  elements,  and  c2.  Assume  that  the  prior 
distribution  has  mean  zero  and 


The  prior  confidence  regions  are  given  by 

P(C)  *  Cj 

or  equivalently 

u*  <■(.  I;]  * c- 

which  reduces  to 

ei  <  c2 

where  Cx  and  C»  are  constants  depending  on  the  level  of  confidence  desired.  For  current  purposes,  we  are 
Interested  only  In  the  shape  of  the  confidence  region,  which  is  Independent  of  the  values  of  the  constants. 
Figure  (5.3-1)  Is  a  sketch  of  the  shape.  Note  that  this  confidence  region  Is  a  limiting  case  of  an  ellipse 
with  major  axis  length  going  to  Infinity  while  the  minor  axis  is  fixed.  This  prior  distribution  gives  Infor¬ 
mation  about  €x>  but  none  about 


Now  consider  a  second  example,  which  Is  Identical  to  the  first  except  that 


In  this  case,  the  prior  confidence  region  Is 


59 


5,3.5  Infinite  GG* 

Corresponding  to  the  case  where  P*1  approaches  a  singular  point  Is  the  similar  case  where  (SG*)'1 
approaches  a  singularity.  As  In  the  case  of  singular  P*1,  there  are  no  computational  problems.  We  can 
readily  evaluate  all  of  the  limits  simply  by  substituting  the  singular  matrix  for  (GG*)*1.  The  Information 
form  avoids  the  use  of  a  nonlnverteo  GG*.  A  singular  (GG*)*1  matrix  would  indicate  that  some  measurement  or 
linear  combination  of  measurements  had  infinite  noise  variance,  which  Is  rather  unlikely.  The  primary  use  of 
singular  (GG*)"1  matrices  In  practice  Is  to  make  the  estimator  Ignore  certain  measurements  If  they  are  worth¬ 
less  or  simply  unavailable.  It  Is  mathematically  cleaner  to  rewrite  the  system  model  so  that  the  unused 
measurements  are  not  Included  In  the  observation  vector,  but  it  Is  sometimes  more  convenient  to  simply  use  a 
singular  (GG*)"1  matrix.  The  two  methods  give  the  sam*>  result.  (Not  having  a  measurement  at  all  Is  equiva¬ 
lent  to  having  one  and  Ignoring  it.)  One  interesting  specific  case  occurs  when  (GG*)*1  approaches  0.  This 
method  then  amounts  to  Ignoring  all  of  the  measurements.  As  might  be  expected,  the  a  posteriori  estimate  Is 
then  the  same  as  the  a  priori  estimate. 


Singular  C*(GG*) 


The  final  special  case  to  be  discussed  is  when  the  C*(GG*)"lC  +  P*1  In  the  Information  form  approaches 
a  singular  value.  Note  that  this  can  occur  only  If  P*1  Is  also  approaching  a  singularity.  Therefore,  the 
problem  cannot  be  avoided  by  using  the  covariance  form.  If  C*(GG*)_1C  +  P*1  is  singular.  It  means  that  there 
is  no  prior  Information  about  a  parameter  or  combination  of  parameters,  and  that  the  experiment  added  no  such 
information.  The  difficulty,  then,  Is  that  there  is  absolutely  no  basis  for  estimating  the  value  of  the  singu¬ 
lar  parameter  or  combination.  The  system  is  referred  to  as  being  unidentifiable  when  this  singularity  Is 
present.  Identlf lability  Is  an  Important  issue  In  the  theory  of  parameter  estimation.  The  easiest  computa¬ 
tional  solution  Is  to  restate  the  problem,  deleting  the  parameter  In  question  from  the  list  of  unknowns. 
Essentially  the  same  result  comes  from  using  a  pseudo-inverse  In  Equation  (5.1-14)  (but  see  the  discussion  In 
Section  2.4.3  on  the  blind  use  of  pseudo- Inverses  to  "solve"  such  problems).  Of  course,  the  best  alternative 
is  often  to  examine  why  the  experiment  gave  no  Information  about  the  parameter,  and  to  redesign  the  experiment 
so  that  a  usable  estimate  can  be  obtained. 


5.4  NONLINEAR  SYSTEMS  WITH  ADDITIVE  GAUSSIAN  NOISE 


The  general  form  of  the  system  equations  for  a  nonlinear  system  with  additive  Gaussian  noise  is 

Z  =  f(5,U)  +  G(U)u  (5.4-1) 

As  in  the  case  of  linear  systems,  we  will  define  by  convention  the  mean  of  w  to  be  zero  and  the  covariance 
to  be  identity.  If  $  is  randcr.i,  we  will  assume  that  it  is  independent  of  u>  and  has  the  distribution  given 
by  Equaticn  (5.1-3). 


Joint  Distribution  of 


To  define  the  estimators  of  Chapter  4,  we  need  to  know  the  distribution  P(Z|?,U).  This  distribution  Is 
easily  derive!  from  Equation  (5.4-1).  The  expressions  f(s,U)  and  G(U)  are  both  constants  If  conditioned  on 
specific  values  of  t  and  U.  Therefore  we  can  apply  the  rules  discussed  in  Chapter  3  for  multiplication  of 
Gaussian  vectors  by  constants  and  addition  of  constants  to  Gaussian  vectors.  Using  these  rules,  we  see  that 
the  distribution  of  Z  conditioned  on  5  and  U  Is  Gaussian  with  mean  f(e,U)  and  covariance  G(U)G(U)*. 

p(Z|c,U)  =  | 2ttG(U)G(U)* I ~1/2  exp{-  \  [Z  -  flf.UlMGtUlGtU)*]*^!  -  f(e,U)]}  (5.4-2) 

This  is  the  obvious  nonlinear  generalization  of  Equation  (5.1-6);  the  nonlinearity  does  not  change  the  basic 
method  of  derivation. 


If  c  is  random,  we  will  need  to  know  the  joint  distribution  p(Z,e|U).  The  joint  distribution  Is  com¬ 
puted  by  Bayes  rule 

p(z,e|u)  =  p(z|e,u)P(c|u)  (5.4-3) 

Using  Equations  (5.1-3)  and  (5.4-2)  gives 

p(Z,£ |U)  =  C | 2ttP |  I2WGE*!]-1'*  exp{-  \  [Z  -  f(c,U)]*tG(U)G(U)*]*1[Z  -  f(C,U)] 

-  \  [C  -  m^]*P*1[£  -  (5.4-4) 

Note  that  p(Z,?|U)  is  not,  in  general,  Gaussian.  Although  Z  conditioned  on  C  is  Gaussian,  and  5 
is  Gaussian,  Z  and  5  need  net  be  jointly  Gaussian.  This  Is  one  of  the  major  differences  between  linear  and 
nonlinear  systems  with  addicive  Gaussian  noise. 

Example  5.4-1  Let  Z  and  £  be  scalars,  P  =  1,  mr  =  0,  G(U)  =  1,  and 
f(C,U)  =  £*.  Then 

P(Z|€,U)  =  (2*)"1/*  exp{-  \  (Z  -  C2)2} 
and 

PU|U)  =  (2ir)-1/2  expj-  | 

This  gives 

p{Z,C|U)  =  (2*)"1  exp{-  |  [E2  +  (Z  -  ?2)2]} 


.'.■V.’VV* 
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The  general  form  of  a  joint  Gaussian  distribution  for  two  variables  Z  and  £  Is 
p(Z,£)  ■  a  exp(b£*  +  cZ*  +  dZ£) 

where  a,  b,  c,.and  d  are  constants.  The  joint  distribution  of  Z  and  £ 
cannot  be  manipulated  into  this  form  because  a  £%  term  appears  In  the 
exponent.  Thus  Z  and  £  are  not  jointly  Gaussian,  even  though  Z  condi¬ 
tioned  on  £  Is  Gaussian  and  £  Is  Gaussian. 

Given  Equation  (5.4-4),  we  can  compute  the  marginal  distribution 
of  £  given  Z  from  the  equations 

P(Z)  -  j  p(Z,£)d£ 
and 

PU|Z)  -  (5.4-6) 

The  Integral  In  Equation  (5.4-5)  Is  not  easy  to  evaluate  In  general.  Since  p(Z,£)  Is  not  necessarily 
Gaussian,  or  any  other  standard  distribution,  the  only  general  means  of  computing  p(Z)  Is  to  numerically 
integrate  Equation  (5.4-5)  for  a  grid  of  Z  values.  If  £  and  Z  are  vectors,  this  can  be  a  quite  formidable 
task.  Therefore,  we  will  avoid  the  use  of  p(Z)  and  P(£ | Z)  for  nonlinear  systems. 

5.4.2  Estimators 


of  Z,  and  the  conditional  distribution 

(5.4-5) 


The  a  posteriori  expected  value  and  Bayes  optimal  estimators  are  seldom  used  for  nonlinear  systems  because 
their  computation  Is  difficult.  Computation  of  the  expected  value  requires  the  numerical  Integration  of 
Equation  (5.4-5)  and  the  evaluation  of  Equation  (5.4-6)  to  find  the  conditional  distribution,  and  then  the 
Integration  of  £  times  the  conditional  distribution.  Theorem  (4.3-1)  says  that  the  Bayes  optimal  estimator 
for  quadratic  loss  is  equal  to  the  a  posteriori  expected  value  estimator.  The  computation  of  the  Bayes  optimal 
estimates  requires  the  same  or  equivalent  multidimensional  Integrations,  so  Theorem  (4.3-1)  does  not  provide  us 
with  a  simplified  means  of  computing  the  estimates. 

Since  the  posterior  distribution  of  £  need  not  be  symmetric,  the  MAP  estimate  Is  not  equal  to  the 
a  posteriori  expected  value  for  nonlinear  systems.  The  MAP  estimator  does  not  require  the  use  of  Equa¬ 
tions  (5.4-5)  and  (5.4-6).  The  MAP  estimate  Is  obtained  by  maximizing  Equation  (5.4-6)  with  respect  to  £. 
Since  p(Z)  Is  not  a  function  of  £,  we  can  equivalently  maximize  Equation  (5.4-4).  For  general,  nonlinear 
systems,  we  must  do  this  maximization  using  numerical  optimization  techniques. 

It  is  usually  convenient  to  work  with  the  logarU.im  of  Equation  (5.4-4).  Since  standard  optimization  con¬ 
ventions  are  phrased  in  terms  of  minimization,  rather  than  maximization,  we  will  state  the  problem  as  minimiz¬ 
ing  the  negative  of  the  logarithm  of  the  probability  density. 

-An  p(Z,£|U)  =  \  [Z  -  f (c,U)]*(GG*)~x[Z  -  f(£,U)]  +  \  [£  -  m?]*p-x[£  -  m$]  +  \  «n[|2«P|  |2wGG*|] 

(5.4-7) 

Since  the  last  term  of  Equation  (5.4-7)  Is  a  constant,  it  does  not  affect  the  optimization.  We  can  there¬ 
fore  define  the  cost  functional  to  be  minimized  as 

J(5)  =  \  [Z  -  f (5,U)]*(GG*)M[Z  -  f(£,U)]  +  \  [£  -  m£]*P-l[£  -  m^)  (5.4-8) 

We  have  omitted  the  dependence  of  J  on  Z  and  U  from  the  notation  because  it  will  be  evaluated  for  specific 
Z  and  U  In  application;  £  Is  the  only  variable  with  respect  to  which  we  are  optimizing.  Equation  (5.4-8) 
makes  it  clear  that  the  MAP  estimator  Is  also  a  least-squares  estimator  for  this  problem.  The  ( GG* ) “ 1  and 
P-1  matrices  are  weightings  on  the  squared  measurement  error  and  the  squared  error  in  the  prior  estimate  of 
£ ,  respectively. 

For  the  maximum  likelihood  estimate  we  maximize  Equation  (5.4-2)  Instead  of  Equation  (5.4-4).  As  in  the 
case  of  linear  systems,  the  maximum  likelihood  estimate  is  equal  to  the  limit  of  the  MAP  estimate  as  P"1  goes 
to  zero;  i.e.,  the  last  term  of  Equation  (5.4-8)  Is  omitted. 

For  a  single  measurement,  or  even  for  a  finite  number  of  measurements,  the  nonlinear  MAP  and  MLE  esti¬ 
mators  have  none  of  the  optimality  properties  discussed  In  Chapter  4.  The  estimates  are  neither  unbiased, 
minimum  variance,  Bayes  optimal,  or  efficient.  When  there  are  a  large  number  of  measurements,  the  differences 
from  optimality  are  usually  small  enough  to  Ignore  for  practical  purposes.  The  main  benefits  of  the  nonlinear 
MLE  and  MAP  estimators  are  their  relative  ease  of  computation  and  their  links  to  the  intuitively  attractive 
idea  of  least  squares.  These  links  give  some  reason  to  suspect  that  even  If  some  of  the  assumptions  about  the 
noise  distribution  are  questionable,  the  estimators  still  make  sense  from  a  nonstatlstical  viewpoint.  The 
final  practical  judgment  of  an  estimator  is  based  on  whether  the  estimates  are  adequate  for  their  intended 
use,  rather  than  on  whether  they  are  exactly  optimum. 

The  extension  of  Equation  (5.4-8)  to  multiple  Independent  experiments  is  straightforward. 

N 

J(€)  =  \  £  [zi  -  f(£,Ui)]*(GG*)-1[Z1  -  f(£,U1)]  +  |  [£  -  me]*p-x[£  - 

1=i 


(5.4-9) 
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where  N  Is  the  number  of  experiments  performed.  The  maximum  likelihood  estimator  Is  obtained  by  omitting 
the  last  term.  The  asymptotic  properties  are  defined  as  N  goes  to  Infinity.  The  maximum  likelihood  esti¬ 
mator  can  be  shown  to  be  asymptotically  unbiased  and  asymptotically  efficient  (and  thus  also  asymptotically 
minimum-variance  unbiased)  under  quite  general  conditions.  The  estimator  Is  also  consistent.  The  rigorous 
proofs  of  these  properties  (Cramer,  1946),  although  not  extremely  difficult,  are  fairly  lengthy  and  will  not 
be  presented  here.  The  only  condition  required  Is  that 

N 

i  Y  ty(?‘ui>;l*(GG*>'lcV,(e,ui)1 

1=1 

converge  to  a  positive  definite  matrix.  Cramer  (1946)  also  proves  that  the  estimates  asymptotically  approach 
a  Gaussian  distribution. 

Since ‘the  maximum  likelihood  estimates  are  asymptotically  efficient,  the  Cramer-Rao  Inequality  (Equa¬ 
tion  (4.2-20))  gives  a  good  estimate  of  the  covariance  of  the  estimate  for  large  N.  Using  Equation  (4.2-19) 
for  the  Information  matrix  gives 

N 

MU)  =  Y  E{tv5f(5,U1)]*(GG*)-1[Z  -  fU.U^HZ  -  f(C,U1))*(GG*)-1[v£fU,U.)]} 

1=i 

N 

=  Y  Cy(e.U1)]*(GG*)-*E{[Z  -  f(5,U1)][Z  -  f(5.U1)]*}(GG*)-1[v5f(5,U1)] 

1=i 
N 

=  Yj  [yu.u^MGG* 

1=1 

N 

=  Y  ty(t.u1)]*(GG* 

i=i 

The  covariance  of  the  maximum  likelihood  i 

cov(e|e) 

When  5  has  a  prior  distribution,  the  corresponding  approximation  for  the  covariance  of  the  posterior  distri¬ 
bution  of  5  Is 

f  N 

covU|Z)  £7 


5f(c,ui)]*(GG*)-i[v£f(e,ui)]  +  P 


(5.4-12) 
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estimate  is  thus  approximated  by 
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5.4.3  Computation  of  the  Estimates 


The  discussion  of  the  previous  section  did  not  address  the  question  of  how  to  compute  the  MAP  and  ML 
estimates.  Equation  (5.4-9)  (without  the  last  term  for  the  MLE)  is  the  cost  functional  to  minimize.  Minimi¬ 
zation  of  such  nonlinear  functions  can  be  a  difficult  proposition,  as  discussed  in  Chapter  2. 

Equation  (5.4-9)  is  in  the  form  of  a  sum  of  squares.  Therefore  the  Gauss-Newton  method  is  often  the  best 
choice  of  optimization  method.  Chapter  2  discussed  the  details  of  the  Gauss-Newton  method.  The  probabilistic 
background  of  Equation  (5.4-9)  allows  us  to  apply  the  central  limit  theorem  to  strengthen  one  of  the  arguments 
used  to  support  the  Gauss-Newton  method. 

For  simplicity,  assume  that  all  of  the  are  Identical.  Compare  the  limiting  behavior  of  the  two  terms 

of  the  second  gradient,  as  expressed  by  Equation  (2.5-10).  The  term  retained  by  the  Gauss-Newton  approximation 
is  N[v5f]*{GG*)'l[V£f],  which  grows  linearly  with  N.  At  the  true  value  of  5,  Z^  -  f (e ,U^ )  Is  a  Gaussian 
random  variable  with  mean  0  and  covariance  GG*.  Therefore,  the  omitted  term  of  the  second  gradient  is  a  sum 
of  Independent,  identically  distributed,  random  variables  with  zero  mean.  By  the  central  limit  theorem,  the 
variance  of  1/N  times  this  term  goes  to  zero  as  N  goes  to  infinity.  Since  1/N  times  the  retained  term 
goes  to  a  nonzero  constant,  the  omitted  term  is  small  compared  to  the  retained  one  for  large  N.  This  conclu¬ 
sion  Is  still  true  If  the  Ui  are  not  Identical,  as  long  as  f  and  Its  gradients  are  bounded  and  the  first 
gradient  does  not  converge  to  zero. 

This  demonstrates  that  for  large  N  the  omitted  term  is  small  compared  to  the  retained  term  if  e  is  at 
the  true  value,  and,  by  continuity,  if  s  is  sufficiently  close  to  the  true  value.  When  5  is  far  from  the 
true  value,  the  arguments  of  Chapter  2  apply. 
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5.4.4  Singularities 

The  singular  cases  which  arise  for  nonlinear  systems  are  basically  the  same  as  1'or  linear  systems  and  have 
similar  solutions.  Limits  as  P"1  or  (GG*)"1  approach  singular  values  pose  no  difficulty.  Singular  P  or  GG* 
matrices  are  handled  by  reducing  the  problem  to  a  nonsingular  subproblem  as  In  the  linear  case. 

The  one  singularity  which  merits  some  additional  discussion  In  the  nonlinear  case  corresponds  to  singular 

N 

£  C^(GG*)-1C1  +  P'1 

1=i 

In  the  linear  case.  The  equivalent  matrix  in  the  nonlinear  case,  If  we  use  the  Gauss-Newton  algorithm,  Is 
given  by 

N 

v*j(5)  ■  £  Cvef(e,U1)]*(GG*)-1[vcf(e,Ui)]  +  P-1  (5.4-13) 

1=i 


If  Equation  (5.4-13)  Is  singular  at  the  true  value,  the  system  Is  said  to  be  unidentifiable.  Me  discussed  the 
computational  problems  of  this'  singularity  In  Chapter  2.  Even  If  the  optimization  algorithm  correctly  finds  a 
unique  minimum.  Equation  (5.4-11)  Indicates  that  the  covariance  of  a  maximum  likelihood  estimate  would  be  very 
large.  (The  covariance  Is  approximated  by  the  Inverse  of  a  nearly  singular  matrix.)  Thus  the  experimental 
data  contain  very  little  Information  about  the  value  of  some  parameter  or  combination  of  parameters.  Note  that 
the  covariance  estimate  Is  unrelated  to  the  optimization  algorithm;  changes  to  the  optimization  algorithm 
might  help  you  find  the  minimum,  but  will  not  change  the  properties  of  the  resulting  estimates.  The  singular¬ 
ity  can  be  eliminated  by  using  a  prior  distribution  with  a  positive  definite  P"1,  brit  in  this  case,  the  esti¬ 
mated  parameter  values  will  be  strongly  Influenced  by  the  prior  distribution,  since  the  experimental  data  is 
lacking  In  Information. 

As  with  linear  systems,  unidentif lability  Is  a  serious  problem.  To  obtain  usable  estimates,  it  Is  gener¬ 
ally  necessary  to  either  reformulate  the  problem  or  redesign  the  experiment.  With  nonlinear  systems,  we  have 
the  additional  difficulty  of  diagnosing  whether  Identiflability  problems  are  present  or  not.  This  difficulty 
arises  because  Equation  (5.4-13)  is  a  function  of  5  and  It  is  necessary  to  evaluate  it  at  or  near  the  minimum 
to  ascertain  whether  the  system  is  Identifiable.  If  the  system  is  not  identifiable,  it  may  be  difficult  for 
the  algorithm  to  approach  the  (possibly  nonunique)  minimum  because  of  convergence  problems. 

5.4.5  Partitioning 

In  both  theory  and  computation,  parameter  estimation  is  much  more  difficult  for  nonlinear  than  for  linear 
systems.  Therefore,  means  of  simplifying  parameter  estimation  problems  are  particularly  desirable  for  non¬ 
linear  systems.  The  partitioning  ideas  of  Section  5.2  have  this  potential  for  some  problems. 

The  parameter  partitioning  ideas  of  Section  5.2.3  make  no  linearity  assumptions,  and  thus  apply  directly 
to  nonlinear  problems.  Me  have  little  more  to  add  to  the  earlier  discussion  of  parameter  partitioning  except 
to  say  that  parameter  partitioning  is  often  extremely  important  in  nonlinear  systems.  It  can  make  the  critical 
difference  between  a  tractable  and  an  intractable  problem  formulation. 

Measurement  partitioning,  as  formulated  in  Section  5.2.1,  is  impractical  for  most  nonlinear  systems.  For 
general  nonlinear  systems,  the  posterior  density  function  p(s|Z,)  will  not  be  Gaussian  or  any  other  simple 
form.  The  practical  application  of  measurement  partitioning  to  linear  systems  arises  directly  from  the  fact 
that  Gaussian  distributions  are  uniquely  defined  by  their  mean  and  covariance. 

The  only  practical  method  of  applying  measurement  partitioning  to  nonlinear  systems  is  to  approximate  the 
function  p (e | Zx )  (or  p(Zj?)  for  MLE  estimates)  by  some  simple  form  described  by  a  few  parameters.  The 
obvious  approximation  in  most  cases  is  a  Gaussian  density  function  with  the  same  mean  and  covariance.  The 
exact  covariance  is  difficult  to  compute,  but  Equations  (5.4-11)  and  (5.4-12)  give  good  approximations  for  this 
purpose. 


5.5  MULTIPLICATIVE  GAUSSIAN  NOISE  (ESTIMATION  OF  VARIANCE) 

The  previous  sections  of  this  chapter  have  assumed  that  the  G  matrix  is  known.  The  results  are  quite 
different  when  G  Is  unknown  because  the  noise  multiplies  G  rather  than  adding  to  it. 

For  convenience,  we  will  work  directly  with  GG*  to  avoid  the  necessity  of  taking  matrix  square  roots. 

We  compute  the  estimates  of  G  by  taking  the  positive  semidef inite,  symmetric-matrix  square  roots  of  the 
estimates  of  GG*. 

The  general  form  of  a  nonlinear  system  with  unknown  G  is 

Z  =  f(«,U)  +  G(s,U)u  (5.5-1) 

We  will  consider  N  independent  measurements  Z-j  resulting  from  the  experiments  Uj.  The  Z-j  are  then 
Independent  Gaussian  vectors  with  means  f(?,U-j)  and  covariances  G(s,Uj)G(t,U-j)*.  We  will  use  Equa¬ 
tion  (5.1-3)  for  the  prior  distributions  of  5.  Bayes  rule  (Equation  (5.4-3) )  then  gives  us  the  joint  distri¬ 
bution  of  5  and  thtf  Zj  given  the  U-j.  Equations  (5.4-5)  and  (5.4-6)  define  the  marginal  distribution  of 
Z  and  the  posterior  distribution  of  t  given  Z.  The  latter  distributions  are  cumbersome  to  evaluate  and 
thus  seldom  used. 
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Because  of  the  difficulty  of  computing  the  posterior  distribution,  the  a  posteriori  expected  value  and 
Bayes  optimal  estimators  are  seldom  used.  We  can  compute  the  maximum  likelihood  estimates  minimizing  the 
negative  of  the  logarithm  of  the  likelihood  functional.  Ignoring  Irrelevant  constant  terms,  the  resulting 
cost  functional  Is 
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0(0  *  \  Y  {[Z1  ‘  f(«)]*C6(«)6(5)*]"1CZ1  -  f(c>]  +  tn|G(5)G(0*l>  (5.5-2) 


or  equivalently 


J(5)  =  \  trace|[G(5)G(0*]'1  Y  [Z1  ‘ 
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|  N  sn|GU)G(e)*l 


(5.5-3) 


We  have  omitted  the  explicit  dependence  on  Ui  from  the  notation  and  assume  that  all  of  the  Ui  are  Identi¬ 
cal.  (The  generalization  to  different  Uj  is  easy  and  changes  little  of  essence.)  The  MAP  estimator  mini¬ 
mizes  a  cost  functional  equal  to  Equation  (5.5-2)  plus  the  extra  term  1/2[e  -  m5]*P_1te  -  m^].  The  MAP  esti¬ 
mate  of  GG*  is  seldom  used  because  the  ML  estimate  Is  easier  to  compute  and  proves  quite  satisfactory. 

We  can  use  numerical  methods  to  minimize  Equation  (5.5-2)  and  compute  the  ML  estimates.  In  most  prac¬ 
tical  problems,  the  following  parameter  partitioning  greatly  simplifies  the  computation  required:  assume  that 
the  e  vector  can  be  partitioned  Into  Independent  vectors  eg  and  Ef  such  that 


GG*  =  GG*Ug) 
f  =  fUf) 


(5.5-4) 


The  partition  Ef  may  be  empty.  In  which  case  f  Is  a  constant  (If  eg  is  empty  we  have  a  known  GG* 
matrix,  and  the  problem  reduces  to  that  discussed  In  the  previous  section).  Assume  further  that  the  GG* 
matrix  is  completely  unknown,  except  for  the  restriction  that  it  be  positive  semldeflnlte. 

Set  the  gradients  of  Equation  (5.5-2)  with  respect  to  GG*  and  Ef  equal  to  zero  In  order  to  find  the 
unconstrained  minimum.  Using  the  matrix  differentiation  results  (A. 2-5)  and  (A. 2-6)  from  Appendix  A,  we  get 

0  =  vGG*J(5;f>GG*) 
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=  -  \  (GG*)'1  Y  ^1  -  f(«f)]tZ1  -  f(Ef)]*(GG*rx  +  |  N(GG*)-> 
i=i 
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(5.5-5) 


(5.5-6) 


Equation  (5.5-5)  gives 


GG*  =  ^  Y  [zi  '  f<«f>Hzi  '  f(ef)l* 


(5.5-7) 
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which  is  the  familiar  sample  second  moment  of  the  residuals.  The  estimate  of  GG*  from  Equation  (5.5-7)  is 
always  positive  semldeflnlte.  It  is  possible  for  this  estimate  to  be  singular.  In  which  case  we  must  use  the 
techniques  previously  discussed  for  handling  singular  GG*  matrices.  For  a  given  Ef,  Equation  (5.5-7)  Is  a 
simple  noniterative  estimator  for  GG*.  This  closed-form  expression  is  the  reason  for  the  partition  of  E 
into  Ef  and  Eg* 

We  can  constrain  GG*  to  be  diagonal,  in  which  case  the  solution  Is  the  diagonal  elements  of  Equa¬ 
tion  (5.5-7).  If  we  place  other  types  of  constraints  on  GG*,  such  as  knowledge  of  the  values  of  individual 
off-diagonal  elements,  such  simple  closed-form  solutions  are  not  apparent.  In  practice,  such  constraints  are 
seldom  required. 
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If  Ef  is  empty.  Equation  (5.5-7)  Is  the  solution  to  the  problem.  If  Ef  Is  not  empty,  we  need  to 
combine  this  subproblem  solution  with  a  solution  for  Ef  to  get  a  solution  of  the  entire  problem.  Let  us 
Investigate  the  two  methods  discussed  in  Section  5.2.3. 

The  first  method  is  axial  iteration.  Axial  iteration  Involves  successively  estimating  Eg  with  fixed 
Ef,  and  estimating  Ef  with  fixed  Eg-  Equation  (5.5-5)  gives  the  eg  estimate  In  closed  form  for  fixed  Ef. 
To  estimate  Ef  with  fixed  Eq>  we  must  minimize  Equation  (5.5-2)  with  respect  to  Ef.  Unless  the  system  Is 
linear,  this  minimization  requires  an  Iterative  method.  For  fixed  G,  Equation  (5. 5-2)  Is  in  the  form  of  a 
sum  of  squares  and  the  Gauss-Newton  method  is  an  appropriate  choice  (in  fact  this  subproblem  Is  identical  to 
the  problem  discussed  In  Section  5.4).  We  thus  have  an  Inner  Iteration  within  the  outer  axial  iteration  of 
Ef  and  Eg-  In  such  situations,  efficiency  is  often  improved  by  terminating  the  Inner  iteration  before  it 
converges,  Inasmuch  as  the  largest  changes  In  the  Ef  estimates  occur  on  the  early  Inner  iterations.  After 
these  early  Iterations,  more  can  be  gained  by  revising  GG*  to  reflect  these  large  changes  than  by  refining 
Ef.  Since  the  estimates  of  Ef  and  GG*  affect  one  another,  there  Is  no  point  In  obtaining  extremely  accurate 
estimates  of  Ef  until  GG*  is  known  to  a  corresponding  accuracy.  As  Gauss  (1809,  p.  249)  said  concerning 
a  different  problem: 


VV> 
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It  then  can  only  be  worth  while  to  aim  at  the  highest  accuracy,  when  the 
final  correction  Is  to  be  given  to  the  orbit  to  be  determined.  But  as  long 
as  It  appears  probable  that  new  observations  will  give  rise  to  new  correc¬ 
tions,  It  will  be  convenient  to  relax,  more  or  less,  as  the  case  may  be  from 
extreme  precision.  If  In  this  way,  the  length  of  the  computations  can  be 
considerably  diminished. 

Exploiting  this  concept  to  Its  fullest  suggests  using  only  one  Iteration  of  the  Gauss-Newton  algorithm  for  the 
Inner  "iteration."  In  this  case  the  Inner  Iteration  Is  no  longer  Iterative,  and  the  overall  algorithm  would 
be  as  follows: 

1.  Estimate  GG*  using  Equation  (5.5-7)  and  the  current  guess  of  Cf. 

2.  Use  one  Iteration  of  the  Gauss-Newton  algorithm  to  revise  the  estimate  of  £f. 

3.  Repeat  steps  1  and  2  until  convergence. 

In  general,  axial  Iteration  Is  a  very  poor  algorithm,  as  discussed  In  Chapter  2.  The  convergence  Is  often 
extremely  slow.  Furthermore,  the  algorithm  can  converge  to  a  point  that  Is  not  a  strict  local  minimum  and  yet 
give  nc  hint  of  a  problem.  For  this  particular  application,  however,  the  performance  of  axial  Iteration 
borders  on  spectacular. 

Let  us  consider,  for  a  while,  the  alternative  to  axial  Iteration:  substituting  Equation  (5.5-7)  Into 
Equation  (5.5-3).  This  substitution  gives 
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J(Sf)  -  £  N  traced)  +  |  N  in  ^  ^  [Z1  '  f(5f>][Z1  ‘  f(*f)]* 

1=i 

The  first  term  Is  irrelevant  to  the  minimization,  so  we  will  redefine  the  cost  function  as 
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J(Sf)  -  |  N  tn  [Zi  -  f(5f)][Z1  -  f(Ef)]* 

1=i 

You  may  sometimes  see  this  cost  function  written  in  the  equivalent  (for  our  purposes)  form 

0(5f)  *  | 56* I 


(5.5-8) 


(5.5-9) 


(5.5-10) 


Examine  the  gradient  of  Equation  (5.5-9).  Using  the  matrix  differentiation  results  (A. 2-3)  and  (A. 2-6) 
from  Appendix  A,  we  obtain 


.#> J(5f) 
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[Z1  -  f(Cf)][Zi  -  f(«f)]*  Z— fjf  f(«f)[Zl  -  (5.5-11) 


i=i  «f 


This  is  more  compactly  expressed  as 


r  n 

V5fJUf)  =  "  S  [Z1  '  f(5f)]*(GG*)-1|v^f(cf)J 


(5.5-12) 


which  is  exactly  the  same  as  Equation  (5.5-6)  evaluated  at  G  =  G.  Furthermore,  the  Gauss-Newton  method  used 
to  solve  Equation  (5.5-6)  is  a  good  method  for  solving  Equation  (5.5-12)  because 


(5.5-13) 


Equation  (5.5-13)  neglects  the  derivative  of  GG*  with  respect  to  £f,  but  we  can  easily  show  that  the  term 
so  neglected  is  even  smaller  than  the  term  containing  v*f(5f),  the  omission  of  which  we  previously  justified. 

Therefore,  axial  Iteration  Is  Identical  to  substitution  of  Equation  (5.5-7)  as  a  constraint.  It  seems 
likely  that  we  could  use  this  equality  to  make  deductions  about  the  geometry  of  the  cost  function  and  thence 
about  the  behavior  of  various  algorithms.  (Perhaps  there  may  be  some  kind  of  orthogonality  property  burled 
here.)  Several  computer  programs.  Including  the  Iliff-Malne  MMLE3  code  (Maine  and  II Iff,  1980;  and  Maine, 
1981),  use  axial  Iteration,  or  a  modification  thereof,  often  with  little  more  justification  than  that  It  seems 
to  work  well.  This  Is,  of  course,  the  final  and  most  Important  justification,  but  It  Is  best  used  as  verifi¬ 
cation  of  analytical  arguments.  Although  Equations  (5.E-12)  and  (5.5-13)  are  derived  In  standard  texts,  we 
have  not  seen  the  relationship  between  these  equations  and  axial  Iteration  pursued  in  the  literature.  It  Is 
plain  that  this  equivalence  relates  to  the  excellent  performance  of  axial  iteration  on  this  problem.  We  will 
leave  further  Inquiry  along  this  line  to  the  reader. 


An  Important  special  case  of  Equation  (5.5-1)  occurs  when  f(tf)  Is  linear 


(5.5-14) 
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with  Invertible  C.  For  linear  f,  Equation  (5.5-6)  Is  solved  exactly  In  a  single  Gauss-Newton  Iteration, 
and  the  solution  Is 


N 

if  -  (C*(GG*)-1C)-lC*(GG*)~1  ^  ^  Z1 

1-i 


If  C  Is  Invertible,  this  reduces  to 


(5.5-15) 


(5.5-16) 


Independent  of  GG*.  This  Is,  of  course,  C*1  times  the  sample  mean.  Substituting  Equations  (5.5-14) 
and  (5.5-16)  into  (5.5-15)  gives 
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(5.5-17) 


which  Is  the  familiar  sample  variance.  Equation  (5.5-17)  can  be  manipulated  Into  the  alternate  form 


Because  if  Is  not  a  function  of  GG*,  the  computation  of  if  and  GG*  does 
system  model . 


(5.5-18) 

not  require  Iteration  for  this 


In  general,  the  maximum  likelihood  estimates  are  asymptotically  unbiased  and  efficient,  but  they  need 
have  no  such  properties  for  finite  N.  For  linear  Invertible  systems,  the  biases  are  easy  to  compute.  From 
Equation  (5.5-16), 

N 

E<if|5f>  -  C-1  J  £  CEf  -  Ef  (5.5-19) 

l-i 


This  equation  shows  that  if  Is  unbiased  for  finite  N  for  linear  Invertible  systems.  From  Equa¬ 
tion  (5.5-18),  using  the  fact  that  iZ^  Is  Gaussian  with  mean  NCEf  and  covariance  NGG*, 

E{GG*|E)  -  [N(CEfEfC*  +  GG*)  -  (N*CEfEfC*  +  NGG*)J  -  GG*  (5.5-20) 

Thus  GG*  Is  biased  for  finite  N.  Examining  Equation  (5.5-20),  we  see  that  the  estimator  defined  by  multi¬ 
plying  the  ML  estimate  by  N/(N-  1)  Is  unbiased  for  finite  N  If  N  >  1.  This  unbiased  estimate  Is  often 
used  Instead  of  the  maximum  likelihood  estimate.  For  large  N,  the  difference  Is  Inconsequential. 

In  this  discussion,  we  have  assumed  that  both  GG*  and  Ef  are  unknown.  If  Ef  Is  known,  then  the  maxi¬ 
mum  likelihood  estimator  for  GG*  is  given  by  Equation  (5.5-7)  and  this  estimate  is  unbiased.  The  proof  is 
left  as  an  exercise.  This  result  gives  Insight  into  the  reasons  for  the  bias  of  the  estimator  given  by 
Equation  (5.5-17).  Note  that  Equations  (5.5-17)  and  (5.5-7)  are  Identical  except  that  the  sample  mean  Is  used 
in  Equation  (5.5-17)  In  place  of  the  true  mean  in  Equation  (5.5-7).  This  substitution  of  the  sample  mean  for 
the  true  mean  has  resulted  In  a  bias. 


The  difference  between  the  estimates  from  Equations  (5.5-17)  and  (5.5-7)  can  be  written  In  the  form 


(5.5-21) 


As  this  expression  shows,  the  estimate  of  GG*  using  the  sample  mean  is  less  than  or  equal  to  the  estimate 
using  the  true  mean  for  every  realization  (l.e.,  the  difference  is  positive  semldeflnlte) ,  equality  occurring 
only  when  all  of  the  Z^  are  equal  to  f(Ef)>  This  is  a  stronger  property  than  the  bias  difference;  the  bias 
difference  implies  only  that  the  expected  value  using  the  sample  mean  Is  less. 


5.6  NON-GAUSSIAN  NOISE 

Non-Gausslan  noise  Is  so  general  a  classification  that  little  can  be  said  beyond  the  discussion  In 
Chapter  4.  The  forms  and  properties  of  the  estimators  depend  strongly  on  the  types  of  noise  distribution. 
The  same  comments  apply  to  Gaussian  noise  If  It  is  not  additive  or  multiplicative,  because  the  conditional 
distribution  of  Z  given  E  is  then  non-Gausslan.  In  general,  we  apply  the  rules  for  transformation  of 
variables  to  derive  the  conditional  distribution  of  Z  given  E-  Using  this  distribution,  and  the  prior  dis¬ 
tribution  of  E  If  defined,  we  can  derive  the  various  estimators  in  principle. 


The  optimal  estimators  of  Chapter  4  often  require  considerable  computation  for  non-Gausslan  noise.  It  Is 
often  possible  to  define  much  simpler  estimators  which  have  adequate  performance.  We  will  examine  one  situa¬ 
tion  where  such  simplification  can  occur. 

Let  the  system  model  be  linear  with  additive  noise 

Z  »  C5  +  u>  (5.6-1) 

The  distribution  of  w  must  have  finite  mean  and  variance  Independent  of  c,  but  Is  otherwise  unrestricted. 
Call  the  mean  mu  and  the  variance  GG*.  We  will  restrict  ourselves  to  considering  only  linear  estimators 
of  the  form 


K  Z  +  D 


(5.6-2) 


Within  this  class,  we  will  look  for  minimum-variance,  unbiased  estimators.  We  will  require  that  the  variance 
be  minimized  only  over  the  class  of  unbiased  linear  estimators;  there  will  be  no  guarantee  that  a  smaller 
variance  cannot  be  attained  by  a  nonlinear  estimator. 


The  bias  of  an  estimator  of  the  form  of  Equation  (5.6-2)  Is 

b(5)  *  E(5|€}  -  5  -  KCe  -  6  +  D  -  Km,, 
If  the  estimator  Is  to  be  unbiased,  we  must  have 


KC  -  I 

The  variance  of  an  unbiased  estlmato*  of  the  given  form  Is 

var(c)  =  KGG*K* 


(5.6-3) 


(5.6-4a) 

(5.6-4b) 


(5.6-5) 


Note  that  the  bias  and  variance  of  the  estimate  depend  only  on  the  mean  and  variance  of  the  noise  distri¬ 
bution.  The  exact  noise  distribution  need  not  even  be  known.  If  the  noise  distribution  were  Gaussian,  a 
minimum-variance  unbiased  estimator  would  exist  and  be  given  by 


5  “  (C*(GG*)“1C)”1C*(GG*)'X(Z1  -  mj 


(5.6-6) 


This  estimator  Is  linear.  Since  no  unbiased  estimator,  linear  or  not,  can  have  a  lower  variance  for  the 
Gaussian  case,  this  estimator  is  the  minimum-variance,  unbiased  linear  estimator  for  Gaussian  noise.  Since 
the  bias  and  variance  of  a  linear  estimator  depend  only  on  the  mean  and  variance  of  the  noise,  this  Is  the 
minimum- variance,  unbiased  linear  estimator  for  any  noise  distribution  with  the  same  mean  and  variance. 

The  optimality  of  this  estimator  can  also  be  easily  proven  without  reference  to  Gaussian  distributions 
(although  the  above  proof  Is  complete  and  rigorous).  Let 


A  =  K  -  (C*(GG*)'1C)”lC*(GG*)”1 


(5.6-7) 


for  any  K.  Then 


0  s  AGG*A*  «  KGG*K*  +  (C*(GG*)-lC)-1C*(GG*)"lGG*(GG*)-1C(C*(GG*)-lC)-1 

-  KGG*(GG*)_1C(C*(GG*)_iC)"1 

-  (C*(GG*)'1C)"lC*(GG*)'lGG*K* 

=  KGG*K*  +  (C*(GG*)"lC)'1 

-  KC(C*(GG*)'1C)_l  -  (C*(GG*)'1C)"1C*K* 

Using  Equation  (4.6-4b)  as  a  constraint  on  K,  Equation  (5.6-8)  becomes 


or,  using  Equation  (5.6-5) 


0  <  KGG*K*  -  (C*(GG*)“1C)'1 


var(t)  a  (C*(GG*)'lC)-1 


(5.6-8) 


(5.6-9) 


(5.6-10) 


Thus  no  K  satisfying  Equation  (5.6~4b)  can  achieve  a  variance  lower  than  that  given  by  Equation  (5.6-10). 
The  variance  Is  equal  to  the  minimum  If  and  only  If  A  Is  zero;  that  Is  if 


K  -  (C*(GG*)'1C)'1C*(GG*)'1 


(5.6-11) 


Therefore  Equation  (5.6-6)  defines  the  unique  minimum-variance,  unbiased  linear  estimator.  We  are  assuming 
that  GG*  and  C*(GG*)_1C  are  nonsingular;  Section  5.3  discusses  the  singular  cases. 


In  summary.  If  the  system  Is  linear  with  additive  noise,  and  the  estimator  is  required  to  be  linear  and 
unbiased,  the  results  for  Gaussian  distributions  apply  to  any  distribution  with  the  same  mean  and  variance. 


'."S  ."V  k.-. VJ  '.’  V  '^'VW  »'V>  3  «\  »V  >■,»■; 
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The  use  of  optimal  nonlinear  estimators  Is  seldom  justifiable  In  view  of  the  current  state  of  the  art. 
Although  exceptional  cases  exist,  three  factors  argue  against  using  optimal  nonlinear  estimators.  The  first 
factor  Is  the  complexity  and  corresponding  cost  of  deriving  and  Implementing  optimal  nonlinear  estimators. 

For  some  problems,  we  can  construct  fairly  simple  suboptlmal  nonlinear  estimators  that  give  better  performance 
th„n  the  linear  estimators  (often  by  slightly  modifying  the  linear  estimator),  but  optimal  nonlinear  estima¬ 
tion  Is  a  difficult  task. 

The  second  factor  Is  that  linear  estimators,  perhaps  slightly  modified,  often  can  give  quite  good  esti¬ 
mates,  even  If  they  are  not  exactly  optimal.  Based  on  the  central  limit  theorem,  several  results  show  that, 
under  fairly  general  conditions,  the  linear  estimates  will  approach  the  optimal  nonlinear  estimates  as  the 
number  of  samples  Increases.  The  precise  conditions  and  proofs  of  these  results  are  beyond  the  scope  of  this 
book. 


The  third  factor  is  that  we  seldom  have  precise  knowledge  of  the  distribution  anyway.  The  errors  from 
Inaccurate  specification  of  the  distribution  are  likely  to  be  as  large  as  the  errors  from  using  a  suboptlmal 
linear  estimator.  He  need  to  consider  this  fact  In  deciding  whether  an  optimal  nonlinear  estimator  Is  really 
worth  the  cost.  From  Gauss  (1809,  p.  253) 

The  Investigation  of  an  orbit  having,  strictly  speaking,  the  maximum  probabil¬ 
ity,  will  depend  upon  a  knowledge  of... [the  probability  distribution];  but 
that  depends  upon  so  many  vague  and  doubtful  considerations- physiological 
Included-whlch  cannot  be  subjected  to  calculation,  that  It  Is  scarcely,  and 
Indeed  less  than  scarcely,  possible.... 
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CHAPTER  6 


6.0  STOCHASTIC  PROCESSES 

In  simplest  terms,  a  stochastic  process  Is  a  random  variable  that  Is  a  function  of  time.  Thus  stochastic 
processes  are  basic  to  the  study  of  parameter  estimation  for  dynamic  systems.  A  complete  and  rigorous  study 
of  stochastic  process  theory  requires  considerable  depth  of  mathematical  background,  particularly  for 
contl nuous-tlme  processes.  For  the  purposes  of  this  book,  such  depth  of  background  Is  not  required.  Our 
approach  does  not  draw  heavily  on  stochastic  process  theory. 

This  chapter  focuses  on  the  few  results  that  are  needed  for  this  document.  Astrom  (1970),  Papoulls 
(1965),  llpster  and  Shlryayev  (1977),  and  numerous  other  books  give  more  complete  treatments  at  varying  levels 
of  abstraction.  The  necessary  results  In  this  chapter  are  largely  concerned  with  continuous-time  models. 
Although  we  derive  a  few  discrete-time  equations  In  order  to  examine  their  continuous-time  limits,  the  chapter 
can  be  omitted  If  you  are  studying  only  discrete-time  analysis. 


6.1  DISCRETE  TIME 

A  discrete-time  random  process  x  Is  simply  a  collection  of  random  variables  x^,  one  for  each  time 
point,  defined  on  the  same  probability  space.  There  can  be  a  finite  or  Infinite  number  of  time  points.  The 
stochastic  process  Is  completely  characterized  by  the  joint  distributions  of  all  of  the  x<.  This  can  be  a 
rather  unwieldy  means  of  characterizing  the  process,  however,  particularly  If  the  number  of  time  points  Is 
Infinite. 

If  the  xi  are  jointly  Gaussian,  the  process  can  be  characterized  by  Its  first  and  second  moments.  Non- 
Gausslan  processes  are  often  also  analyzed  In  terms  of  their  first  two  moments  because  exact  analyses  are  too 
complicated.  The  first  two  moments  of  the  process  x  are 

m(1)  -  E{X|>  (6.1-1) 

R(1,j)  -  E(XjXj)  (6.1-2) 

The  function  R(1,j)  Is  called  the  autocorrelation  function  of  the  process. 

A  process  Is  called  stationary  If  the  joint  distribution  of  any  collection  of  the  xj  depends  only  on 
differences  of  the  1  values,  not  on  the  absolute  time.  This  Is  called  strlct-sense  statlonarlty.  A  process 
Is  stationary  to  second  order  or  wlde-sense  stationary  If  the  first  moment  Is  constant  and  the  second  moments 
depend  only  on  time  differences;  l.e..  If 


R(1  -  k,j  -  k)  -  R(1,j)  (6.1-3) 

for  all  1,  j,  and  k.  For  Gaussian  processes  wlde-sense  statlonarlty  Implies  strlct-sense  statlonarlty.  The 
autocorrelation  function  of  a  wlde-sense  stationary  process  can  be  written  as  a  function  of  one  variable,  the 
time  difference. 


R(k)  *  R(1 ,1  +  k)  (6.1-4) 

A  process  Is  called  white  If  x^  is  independent  of  xj  for  all  1  f  j.  Thus  a  Gaussian  process  is  white 
if  R(1,j)  =  0  when  i  f  j.  Any  process  that  Is  not  white  is  called  colored.  A  white  process  can  be  charac¬ 
terized  by  the  distribution  of  x^  for  each  1.  If  a  process  Is  both  white  and  stationary,  the  distribution 
of  xi  is  the  same  as  that  of  x{  for  all  i  and  j,  and  this  distribution  is  sufficient  to  characterize  the 
process. 

6.1.1  Linear  Systems  Forced  by  Gaussian  White  Noise 

Our  primary  Interest  in  this  chapter  is  in  the  results  of  passing  random  signals  through  dynamic  systems. 
We  will  first  look  at  the  simplest  case,  stationary  white  Gaussian  noise  passing  through  a  linear  system.  The 
system  equation  is 


x1+i  *  *X1  +  Fn1  i  *  0,1,...  (6.1-5) 

where  n  is  a  stationary,  Gaussian,  white  process  with  zero  mean  and  identity  covariance.  The  assumption  of 
zero  mean  is  made  solely  to  simplify  the  equations.  Results  for  nonzero  mean  can  be  obtained  by  linear  super¬ 
position  of  the  deterministic  response  to  the  mean  and  the  stochastic  response  to  the  process  with  the  mean 
removed.  We  are  also  given  that  x0  is  Gaussian  with  mean  0  and  covariance  P0,  and  that  xt  is  independent 
of  the  n^. 

The  xi  form  a  stochastic  process  generated  from  the  n^.  We  desire  to  examine  the  properties  of  the 
stochastic  process  x.  It  Is  Immediately  obvious  that  x  Is  Gaussian  because  xj  can  be  written  as  a  linear 
combination  of  x0  and  n.,  nj,...nj_,.  In  fact,  the  joint  distribution  of  the  x^  can  be  easily  derived  by 
explicitly  writing  this  linear  relation  and  using  Theorem  (3.5-5).  We  will  leave  this  derivation  as  an  exer¬ 
cise,  and  pursue  Instead  a  derivation  using  recursion  along  the  lines  that  will  be  used  In  Chapter  7. 

Assume  we  know  that  x<  has  mean  0  and  covariance  P-j.  Then  the  distribution  of  xi+j  follows  imme¬ 
diately  from  Equation  (6.1-5): 


E{xi+1>  *  *E{x.j}  +  FE{n^}  =  0 


(6.2-6) 


E{x1+ixW  "  +  FEtn1n|>F*  +  ♦£{xinf}F*  +  FEI^xf}**  -  aP^*  +  FF* 


(6.1-7) 


The  cross  terns  In  Equation  (6.1-7)  drop  out  because  xj  Is  a  function  only  of  xt  and  n»,  n1,...ni.1>  all 
of  which  are  Independent  of  nj  by  assumption.  Me  now  have  a  recursive  formula  for  the  covariance  X{ 


-  aP4a*  +  FF*  1  -  0,1,. 


(6.1-8) 


where  P,  Is  a  given  point  from  which  we  can  start  the  recursion. 

Me  know  that  the  xj  are  jointly  Gaussian  zero-mean  variables  with  covariances  given  by  the  recursion 
(6.1-8).  To  coaq>lete  the  characterization  of  the  joint  distribution  of  the  xj,  we  need  only  the  cross¬ 
covariances  E{x{xt>  for  1  f  j.  Assume  without  loss  of  generality  that  1  >  j.  Then  xi  can  be  written  as 


xi  -  ♦1‘Jxj  ♦  £ 
k-j 


(6.1-9) 


ECXjxJ)  -  e^EtXjXj)  +  a1'1‘kFE{nkx*}  •  a1'^ 
k-j 


(6.1-10) 


The  cross  terms  In  Equation  (6.1-10)  are  all  zero  by  the  same  reasoning  as  used  for  Equation  (6.1-7).  For 
1  <  j,  the  same  derivation  (or  transposition  of  the  above  result)  gives 


Etx^J}  -  P^a*)^-1  1  <  j 


(6.1-11) 


This  completes  the  derivation  of  the  joint  distribution  of  the  x^.  Note  that  x  Is  neither  stationary  nor 
white  (except  In  special  cases). 

6.1.2  Nonlinear  Systems  and  Non-Gausslan  Noise 

If  the  noise  Is  not  Gaussian,  analyzing  the  system  becomes  much  more  difficult.  Except  In  special  cases, 
we  then  have  to  work  with  the  probability  distributions  as  functions  Instead  of  simply  using  the  means  and 
covariances.  Similar  problems  arise  for  nonlinear  systems  or  nonadditive  noise  even  If  the  noise  Is  Gaussian, 
because  the  distributions  of  the  xi  will  not  then  be  Gaussian. 


Consider  the  system 


f(x4,n4)  1  -  0,1,... 


(6.1-12) 


Assume  that  f  has  continuous  partial  derivatives  almost  everywhere,  and  can  be  Inverted  to  obtain  n^ 
(trivial  If  the  noise  Is  additive): 


n1  -  rl(x,,x1n) 


(6.1-13) 


The  ni  are  assumed  to  be  white  and  Independent  of  x,,  but  not  necessarily  Gaussian.  Then  the  conditional 
distribution  of  xj+1  given  xj  can  be  obtained  from  Equation  (3.4-1) 

pXi+i|Xi(x1+i|x1)  "  Pn1(f'1{x1*x1+x>>ldet<J>l  (6*1-14) 


where  J  Is  the  Jacobian  of  the  transformation  f-1.  The  joint  distribution  of  x0,...x^  can  then  be 
obtained  from 


px<x . XN>  ■  pxQ<x«>  £  pxi|xi.i(xllx1-x) 


(6.1-15) 


Equations  (6.1-14)  and  (6.1-15)  are.  In  general,  too  unwieldy  to  work  with  In  practice.  Practical  work  with 
nonlinear  systems  or  non-Gausslan  noise  usually  Involves  simplifying  approximations. 


6.2  CONTINUOUS  TIME 

Me  will  look  at  continuous-time  stochastic  processes  by  looking  at  limits  of  discrete-time  processes  with 
the  time  Interval  going  to  0.  The  discussion  will  focus  on  how  to  take  the  limit  so  that  a  useful  result  Is 
obtained.  We  will  not  get  Involved  In  the  Intricacies  of  Ito  or  Stratanovlch  calculus  (Astrom,  1970; 
Jazwlnskl,  1970;  and  Llpster  and  Shlryayev,  1977). 

6.2.1  Linear  Systems  Forced  by  White  Noise 

Consider  a  linear  continuous-time  dynamic  system  driven  by  white,  zero-mean  noise 

x(t)  -  Ax(t)  +  Fcn(t)  (6.2-1) 


(6.2-1) 


(6.2-2) 


We  Mould  like  to  look  at  this  system  as  a  limit  (In  some  sense)  of  the  discrete-time  systems 

x(t1  ♦  a)  -  (I  +  AAjxft^  +  aF^^) 

as  A >  the  time  Interval  between  samples,  goes  to  zero.  Equation  (6.2-2)  Is  In  the  form  of  Euler's  method  for 
approximating  the  solution  of  Equation  (6.2-1).  For  the  moment  we  will  consider  the  discrete  n(t^)  to  be 
Gaussian.  The  distribution  of  the  n[U)  Is  not  particularly  Important  to  the  end  result,  but  our  argument  is 
somewhat  easier  If  the  n(tj)  are  Gaussian.  Equation  (6.2-2)  corresponds  to  Equation  (6.1-5)  with  I  +  Aa 

substituted  for  a,  aFc  substituted  for  F,  and  some  changes  In  notation  to  make  the  discrete  and  continuous 

notations  more  similar. 

If  n  were  a  reasonably  behaved  deterministic  process,  we  would  get  Equation  (6.2-1)  as  a  limit  of  Equa¬ 
tion  (6.2-2)  when  A  goes  to  zero.  For  the  stochastic  system,  however,  the  situation  is  quite  different. 

Substituting  I  +  aA  for  *  and  aFc  for  F  In  Equation  (6.1-8)  gives 

P(t1  +  a)  -  (I  +  AA)P(t1)(I  +  AA)*  +  ajFcFJ  (6.2-3) 

Subtracting  P(tj)  and  dividing  by  a  gives 

P(t,  +  A)  -  P(t.) 

- - - j - —  -  APU^  +  P(t1)A*  +  aAPU^A*  +  aFcF£  (6.2-4) 

Thus  in  the  limit 

P(t)  -  AP(t)  +  P(t)A  (6.2-5) 

Note  that  Fc  has  completely  dropped  out  of  Equation  (6.2-5).  The  distribution  of  x  does  not  depend  on  the 
distribution  of  the  forcing  noise.  In  particular.  If  P0  »  0,  then  P(t)  «  0  for  all  t.  The  system  simply 
does  not  respond  to  the  forcing  noise. 

A  model  in  which  the  system  does  not  respond  to  the  noise  is  not  very  useful.  A  useful  model  would  be  one 
that  gives  a  finite  nonzero  covariance.  Such  a  model  Is  achieved  by  multiplying  the  noise  by  a*1/2  (and  thus 
its  covariance  by  A*1).  We  rewrite  Equation  (6.2-2)  as 

x(t,|  +  a)  »  (I  +  AA)x(t^)  +  A1/2F(.n(t1)  (6.2-6) 

The  a  in  the  aFcF$  term  of  Equation  (6.2-4)  then  disappears  and  the  limit  becomes 

P(t)  -  AP(t)  +  P(t)A*  +  FCF*  (6.2-7) 

Note  that  only  a  a’1  behavior  of  the  covariance  (or  something  asymptotic  to  a’1)  will  give  a  finite  nonzero 
result  in  the  limit. 

We  will  thus  define  the  continuous-time  white-noise  process  in  Equation  (6.2-1)  as  a  limit,  in  some 
sense,  of  discrete-time  processes  with  covariances  A"1.  The  autocorrelation  function  of  the  continuous-time 
process  Is 

R(t,t)  ■=  E{n(t)n(t)*}  -  6(t  -  t)  (6.2-8) 

The  impulse  function  6(s)  is  zero  for  x  f  0  and  infinite  for  s  -  0,  and  its  Integral  over  any  finite  range 
including  the  origin  is  1.  We  will  not  go  through  the  mathematical  formalism  required  to  rigorously  define 
the  impulse  function- suffice  It  to  say  that  the  concept  can  be  defined  rigorously. 

This  model  for  a  continuous-time  white-noise  process  requires  further  discussion.  It  is  obviously  not  a 
faithful  representation  of  any  physical  process  because  the  variance  of  n(t)  is  infinite  at  every  time  point. 
The  total  power  of  the  process  Is  also  Infinite.  The  response  of  a  dynamic  system  to  this  process,  however, 
appears  well-behaved. 

The  reasons  for  this  apparently  anomalous  behavior  are  most  easily  understood  in  the  frequency  domain. 

The  power  spectrum  of  the  process  n  is  flat;  there  Is  the  same  power  in  every  frequency  band  of  the  same 
width.  There  Is  finite  power  in  any  finite  frequency  range,  but  because  the  process  has  infinite  bandwidth, 
the  total  power  Is  infinite.  Because  any  physical  system  has  finite  bandwidth,  the  system  response  to  the 
noise  will  be  finite.  If,  on  the  other  hand,  we  kept  the  total  power  of  the  noise  finite  as  we  originally 
tried  to  do,  the  power  In  any  finite  frequency  band  would  go  to  zero  as  we  approached  infinite  bandwidth;  thus, 
a  physical  system  would  have  zero  response. 

The  preceding  paragraph  explains  why  it  Is  necessary  to  have  infinite  power  in  a  meaningful  continuous¬ 
time  white-noise  process.  It  also  suggests  a  rationale  for  justifying  such  a  model  even  though  any  physical 
noise  source  must  have  finite  power.  We  can  envision  the  physical  noise  as  being  band  limited,  but  with  a 
band  limit  much  larger  than  the  system  band  limit.  If  the  noise  band  limit  is  large  enough,  its  exact  value 
is  unimportant  because  the  system  response  to  Inputs  at  a  very  high  frequency  is  negligible.  Therefore,  we 
can  analyze  the  system  with  white  noise  of  Infinite  bandwidth  and  obtain  results  that  are  very  good  approxima¬ 
tions  to  the  finite-bandwidth  results.  The  analysis  Is  much  simpler  in  the  infinite-bandwidth  white-noise 
model  (even  though  some  fairly  abstract  mathematics  Is  required  to  make  it  rigorous).  In  sunmary,  continuous¬ 
time  white-noise  is  not  physically  realizable  but  can  give  results  that  are  good  approximations  to  physical 
systems . 
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6.2.2  Additive  Whitt  Measurement  Noise 


We  saw  In  the  previous  section  that  continuous-time  white  noise  driving  a  dynamic  system  must  have 
Infinite  power  In  order  to  obtain  useful  results.  We  will  show  In  this  section  that  the  same  conclusion 
appllas  to  continuous-time  white  measurement  noise. 

We  suppose  that  noise-corrupted  measurements  z  are  made  of  the  system  of  Equation  (6.2-1).  The  mea¬ 
surement  equation  Is  assumed  to  be  linear  with  additive  white  noise: 

z(t)  -  Cx(t)  +  Scn(t)  (6.2-9) 

For  convenience,  we  will  assume  that  the  mean  of  the  noise  Is  0.  We  then  ask  what  else  must  be  said  about 
n(t)  In  order  to  obtain  useful  results  from  this  model. 

Presume  that  we  have  measured  z(t)  over  the  Interval  Q  <  t  <  T,  and  we  want  to  estimate  some  character¬ 
istic  of  the  system-say,  x(T).  This  Is  a  filtering  problem,  which  we  will  discuss  further  In  Chapter  7.  For 
current  purposes,  we  will  simplify  the  problem  by  assuming  that  A  -  0  and  F  »  0  In  Equation  (6.2-1).  Thus 
x(t)  Is  a  constant  over  the  Interval,  and  dynamics  do  not  enter  the  problem.  We  can  consider  this  a  static 
problem  with  repeated  observations  of  a  random  variable,  like  those  situations  we  covered  In  Chapter  5. 

Let  us  look  at  the  limit  of  the  discrete-time  equivalents  to  this  problem.  If  samples  are  taken  every 
a  seconds,  there  are  a-1T  total  samples.  Equation  (5.1-31)  Is  the  MAP  estimator  for  the  discrete-time 
problem.  The  mean  square  error  of  the  estimate  Is  given  by  Equations  (5.1-32)  to  (5.1-34).  As  a  decreases 
to  0  and  the  number  of  samples  Increases  to  Infinity,  the  mean  square  error  decreases  to  0.  This  result  would 
Imply  that  continuous-time  estimates  are  always  exact;  It  Is  thus  not  a  very  useful  model.  To  get  a  useful 
model,  we  must  let  the  covariance  of  the  measurement  noise  go  to  Infinity  like  A*1  as  A  decreases  to  0. 
This  argument  Is  very  similar  to  that  used  In  the  previous  section.  If  the  measurement  noise  had  finite 
variance,  each  measurement  would  give  us  a  finite  amount  of  Information,  and  we  would  have  an  Infinite  amount 
of  Information  (no  uncertainty)  when  the  number  of  measurements  was  Infinite.  Thus  the  discrete-time  equiva¬ 
lent  of  Equation  (6.2-9)  Is 


z(tj)  -  Cx(tj)  +  A_l^lGcn(t1) 


(6.2-10) 


where  n(t^)  has  Identity  covariance. 

Because  any  measurement  Is  made  using  a  physical  device  with  a  finite  bandwidth,  we  stop  getting  much  new 
Information  as  we  take  samples  faster  than  the  response  time  of  the  Instrument.  In  fact,  the  measurement  equa¬ 
tion  Is  sometimes  written  as  a  differential  equation  for  the  Instrument  response  Instead  of  In  the  more  Ideal¬ 
ized  form  of  Equation  (6.2-9).  We  need  a  noise  model  with  a  finite  power  In  the  bandwidth  of  the  measurements 
because  this  Is  the  frequency  range  that  we  are  really  working  In.  This  argument  Is  essentially  the  same  as 
the  one  we  used  In  the  discussion  of  white  noise  forcing  the  system.  The  white  noise  can  again  be  viewed  as 
an  approximation  to  band-limited  noise  with  a  large  bandwidth.  The  lack  of  fidelity  In  representing  very  high- 
frequency  characteristics  Is  not  too  Important,  because  high  frequencies  will  tend  to  be  filtered  out  when  we 
operate  on  the  data.  (For  Instance,  most  operations  on  continuous-time  data  will  have  Integrations  at  some 
point.)  As  a  consequence  of  this  modeling,  we  should  be  dubious  of  the  practical  application  of  any  algorithm 
which  results  from  this  analysis  and  does  not  filter  out  high-frequency  data  in  some  manner. 

We  can  generalize  the  conclusions  In  this  and  the  previous  section.  Continuous-time  white  noise  with 
finite  variance  Is  generally  not  a  useful  concept  In  any  context.  We  will  therefore  take  as  part  of  the  defi¬ 
nition  of  continuous -time  white  noise  that  it  have  Infinite  covariance.  We  will  use  the  spectral  density 
rather  than  the  covariance  as  a  meaningful  measure  of  the  noise  amplitude.  White  noise  with  autocorrelation 


R(t,t)  -  GcG*6(t  -  t)  (6.2-11) 

has  spectral  density  GcGj. 

6.2.3  Nonlinear  Systems 


As  with  discrete-time  nonlinearities,  exact  analysis  of  nonlinear  continuous-time  systems  Is  generally  so 
difficult  as  to  be  Impossible  for  most  practical  Intents  and  purposes.  The  usual  approach  is  to  use  a  linear¬ 
ization  of  the  system  or  some  other  approximation. 


Let  the  system  equation  be 


x(t)  »=  f (x,t)  +  g(x,t)n(t) 


(6.2-12) 


where  n  Is  zero-mean  white  noise  with  unity  power,  spectral  density.  For  compactness  of  notation,  let  p 
represent  the  distribution  of  x  at  time  t,  given  that  x  was  x„  at  time  t0.  The  evolution  of  this 
distribution  Is  described  by  the  following  parabolic  partial  differential  equation: 


n  n 

H  ■  -  £  in  <p,i>  ♦  i  Z  sojT 

1-i  1  1,j,k-i  1  J 

where  n  Is  the  length  of  the  x  vector.  The  Initial  condition  for  this  equation  at  t  =  t.  is 
p  -  «(x  -  xc).  See  Jazwlnski  (1970)  for  the  derivation  of  Equation  (6.2-13).  This  equation  is  called  the 
Fokker-Planck  equation  or  the  forward  Kolmogorov  equation.  It  Is  considered  one  of  the  basic  equations  of 
nonlinear  filtering  theory.  In  principle,  this  equation  completely  describes  the  behavior  of  the  system  and 
thus  the  problem  is  "solved."  In  practice,  the  solution  of  this  multidimensional  partial  differential  equa¬ 
tion  Is  usually  too  formidable  to  consider  seriously. 
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CHAPTER  7 


7.0  STATE  ESTIMATION  FOR  DYNAMIC  SYSTEMS 

In  tMs  chapter,  we  address  the  estimation  of  the  state  of  dynamic  systems.  The  emphasis  is  on  linear 
dynamic  systems  with  additive  Gaussian  noise.  We  will  initially  develop  the  theory  for  discrete-time  systems 
and  then  extend  it  to  continuous-time  and  mixed  continuous/discrete  models. 

The  general  form  of  a  linear  discrete-time  system  model  is 

*i+i  =  4xi  +  yui  +  Fn-j  i  =  0,1,...  (7.0-la) 

zi  =  Cxi  +  Du1  +  Goi  1  =  1,2,...  (7.0-lb) 


The  ni  and  are  assumed  to  be  independent  Gaussian  noise  vectors  with  zero  mean  and  identity  covariance. 

The  noise  n  is  called  process  noise  or  state  noise;  ti  is  called  measurement  noise.  The  Input  vectors,  ui, 
are  assumed  tc  bo  known  exactly.  The  state  of  the  system  at  the  1th  time  point  is  xi.  The  initial  condi¬ 
tion  x0  is  a  Gaussian  random  variable  with  mean  m0  and  covariance  P0.  (P0  can  be  zero,  meaning  that  the 
initial  condition  is  known  exactly.) 

In  general,  the  system  matrices  4>,  V,  F,  C,  D,  and  G  can  be  functions  of  time.  This  chapter  will  assume 
that  the  system  is  time-invariant  in  order  to  simplify  the  notation.  Except  for  the  discussion  of  steady-state 
forms  in  Section  7.3,  the  results  are  easily  generalized  to  time-varying  systems  by  adding  appropriate  time 
subscripts  to  the  matrices. 


Tiie  state  estimation  problem  is  defined  as  follows:  based  on  the  measurements  sx,  z2...zn»  estimate  the 
state  xm.  To  shorten  the  notation,  we  define 


ZN  =  (zi»22*--zn)* 


(7.0-2) 


State  estimation  problems  are  commonly  divided  into  three  classes,  depending  on  the  relationship  of  M  and  N. 

If  M  is  equal  to  N,  the  problem  is  called  a  filtering  problem.  Based  on  all  of  the  measurements  taken 
up  to  the  current  time,  we  desire  to  estimate  the  current  state.  This  type  of  problem  is  typical  of  those 
encountered  in  real-time  applications.  It  is  the  most  widely  treated  one,  and  the  one  on  which  we  will 
concentrate. 

If  M  s  greater  than  N,  we  have  a  prediction  problem.  The  data  are  available  up  to  the  current  time 
N,  and  we  desire  to  predict  the  state  at  some  future  time  M.  We  will  see  that  once  the  filtering  problem  is 
solved,  the  prediction  problem  is  trivial. 

If  M  is  less  than  N,  the  problem  is  called  a  smoothing  problem.  This  type  of  problem  is  most  commonly 
encountered  in  postexperiment  batch  processing  in  which  all  of  the  data  are  gathered  before  processing  begins. 
In  this  case,  the  estimate  of  x^  can  be  based  on  all  of  the  data  gathered,  both  before  and  after  time  M. 

By  using  all  values  of  M  from  1  to  N  -  1,  plus  the  filtered  solution  for  M  =  N,  we  can  construct  the  esti¬ 
mated  state  time  history  For  the  interval  being  processed.  This  is  referred  to  as  fixed-interval  smoothing. 
Smoothing  can  also  be  used  in  a  real-time  environment  where  a  few  time  points  of  delay  in  obtaining  current 
sta+e  estimates  is  an  acceptable  price  for  the  improved  accuracy  gained.  For  instance,  it  might  be  acceptable 
to  gather  data  up  to  time  N  =  M  +  2  before  computing  the  estimate  of  x^.  This  is  called  fixed-lag  smooth¬ 
ing.  A  third  type  of  smoothing  is  fixed-point  smoothing;  in  this  case,  it  is  desired  to  estimate  xm  for  a 
particular  fixed  M  in  a  real-time  environment,  using  new  data  to  improve  the  estimate. 

In  all  cases,  xn  will  have  a  prior  distribution  derived  from  Equation  (7.0-la)  and  the  noise  distribu¬ 
tions.  Since  Equation  (7.0-1)  is  linear  in  the  noise,  and  the  noise  is  assumed  Gaussian,  the  prior  and 
posterior  distributions  of  xr  will  be  Gaussian.  Therefore,  the  a  posteriori  expected  value,  MAP,  and  many 
Bayes1  minimum  risk  estimators  will  be  identical.  These  are  the  obvious  estimators  for  a  problem  with  a  well- 
defined  prior  distribution.  The  remainder  of  the  chapter  assumes  the  use  of  these  estimators. 


7.1  EXPLICIT  FORMULATION 

By  manipulating  Equation  (7.0-1)  into  an  appropriate  form,  we  can  write  the  state  estimation  problem  as  a 
special  case  of  the  static  estimation  problem  studied  in  Chapter  5.  In  this  section,  we  will  solve  the  problem 
by  such  manipulation;  the  fact  that  a  dynamic  system  is  involved  will  thus  play  no  special  role  in  the  meaning 
of  the  estimation  problem.  We  will  examine  only  the  filtering  problem  here. 

Our  aim  is  to  manipulate  the  state  estimation  problem  into  the  form  of  Equation  (5.1-1).  The  most  obvious 
approach  to  this  problem  is  to  define  the  s  of  Equation  (5.1-1)  to  be  xn,  the  vector  which  we  desire  to 
estimate.  The  observation,  Z,  would  be  a  concatenation  of  Zj,...,Zty;  and  the  input,  U,  would  be  a  concatena¬ 
tion  of  i,7,...,u|i|.1.  The  noise  vector,  w,  would  then  have  to  be  a  concatenation  of  n1,...,nN_1,m..-.nN- 
The  prob’  1  can  indeed  be  written  in  this  mannev .  Unfortunately,  the  prior  distribution  of  xn  is  not  inde¬ 
pendent  of  n. ,...,nu_1  (except  for  the  case  N  =  0);  therefore,  Equation  (5.1-16;  is  not  the  correct  expres¬ 
sion  for  the  MAP  estimate  of  xn-  Of  course,  we  could  derive  ar,  appropriate  expression  allowing  for  the  corre¬ 
lation,  but  we  will  take  an  alternate  approach  which  allows  the  direct  use  of  Equation  (5.1-16). 

Let  the  unknown  parameter  vector  be  the  concatenation  of  the  initial  condition  and  all  of  the  process 
noise  vectors. 


C  ■  [x#,n#,n1,...nN_1]*  (7.1-1) 

The  vector  xw,  which  we  really  desire  to  estimate,  can  be  written  as  an  explicit  function  of  the  elements  of 


ine  vector  xw,  wmcn  we  reaiiy  aesire  to  estimate,  can  De  written  as  a 
£,  In  particular.  Equation  (7.0-la)  expands  Into 

•  N-i 

XN  =  *Nx«  +  Y  *N“i‘1(^u1  + 


(7.1-2) 


We  can  compute  the  MAP  estimate  of  x^  by  using  the'MAP  estimates  of  x#  and  n-j  In  Equation  (7.1-2).  Note 
that  we  can  freely  treat  the  nj  as  noise  or  as  unknown  parameters  with  prior  distributions  without  changing 
the  essential  nature  of  the  problem.  The  probability  distribution  of  2  Is  Identical  in  either  case.  The 
only  distinction  is  whether  or  not  we  want  estimates  of  the  ni.  For  this  choice  of  £,  the  remaining  Items 
of  Equation  (5.1-1)  must  be 

Z  =  . Z(j]* 

U  =  [u#.Ul . u^]*  (7.1-3) 

a  =  Cn^nj."..^]* 

We  get  an  explicit  formula  for  zi  by  substituting  Equation  (7.1-2)  into  Equation  (7.0-lb),  giving 


z.  =  C»1x#  +  C  ^  •1"l-J(fuj  +  Fnj )  +  Dui  +  Gni 


(7.1-4) 


which  can  be  written  in  the  form  of  Equation  (5.1-1)  with 


C(U)  = 
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(7. l-5a ) 


0  0 


D  0 


(7 . l-5b) 


(7.1-5c) 


You  can  easily  verify  these  matrices  by  substituting  them  Into  Equation  (5.1-1).  The  mean  and  covariance  of 
the  prior  distribution  of  £  are 

M  fp«  0  0  •••  °1 


0  0  I  0  ...  0 

m^=0  ,  P=0  0  0  ...  0 


(7.1-6) 


The  MAP  estimate  of  £  is  then  given  by  Equation  (5.1-16).  The  MAP  estimate  of  xr,  which  we  seek,  Is 
obtained  from  that  of  £  by  using  Equation  (7.1-2). 
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The  filtering  problem  Is  thus  "solved."  This  solution,  however.  Is  unacceptably  cumbersome.  If  the 
system  state  Is  an  t-vector,  the  Inversion  of  an  (N  +  l)t-by-(N  +  l)i  matrix  Is  required  In  order  to  estimate 
Xty.  The  computational  costs  become  unacceptable  after  a  very  few  time  points.  We  cold  Investigate  whether  It 
Is  possible  to  take  advantage  of  the  structure  of  the  matrices  given  In  Equation  (7.1-5)  In  order  to  simplify 
the  computation.  We  can  more  readily  achieve  the  same  ends,  however,  by  adopting  a  different  approach  to 
solving  the  problem  from  the  start. 


7.2  RECURSIVE  FORMULATION 


To  find  a  simpler  solution  to  the  filtering  problem  than  that  derived  In  the  preceding  section,  we  need  to 
take  better  advantage  of  the  special  structure  of  the  problem.  The  above  derivation  used  the  linearity  of  the 
problem  and  the  Gaussian  assumption  on  the  noise,  which  are  secondary  features  of  the  problem  structure.  The 
fact  that  the  problem  Involves  a  dynamic  state-space  model  Is  much  more  basic,  but  was  not  used  above  to  any 
special  advantage;  the  first  step  In  the  derivation  was  to  recast  the  system  In  the  form  of  a  static  model. 

Let  us  reexamine  the  problem,  making  use  of  the  properties  of  dynamic  state-space  systems. 


The  defining  property  of  a  state-space  model  is  as  follows:  the  future  output  is  dependent  only  on  the 
current  state  and  the  future  Input.  In  other  words,  provided  that  the  current  state  of  the  system  Is  known, 
knowledge  of  any  previous  states.  Inputs,  or  outputs,  is  irrelevant  to  the  prediction  of  future  system  behav¬ 
ior;  all  relevant  facts  about  previous  behavior  are  subsumed  In  the  knowledge  of  the  current  state.  This  Is 
essentially  the  definition  of  the  state  of  a  system.  The  probabilistic  expression  of  this  Idea  is 


p(zN*zN+i —  =  p^zN*zN+i**-IxN*xN-i"*uN-x,uN-2,*,ZM-i^ 


(7.2-1) 


It  is  this  property  that  allows  the  system  to  be  described  In  a  recursive  form,  such  as  that  of  Equa¬ 
tion  (7.0-1).  The  recursive  form  Involves  much  less  computation  than  the  mathematically  equivalent  explicit 
form  of  Equation  (7.1-4). 

This  reasoning  suggests  that  recursion  might  be  used  to  some  advantage  in  obtaining  a  solution  to  the 
filtering  problem.  The  estimators  under  consideration  (MAP,  etc.)  are  all  defined  from  the  conditional  dis¬ 
tribution  of  xn  given  Zr.  We  will  seek  a  recursive  expression  for  the  conditional  distribution,  and  thus 
for  the  estimates.  We  will  prove  that  such  an  expression  exists  by  deriving  it. 

In  the  nature  of  recursive  forms,  we  start  by  assuming  that  the  conditional  distribution  of  xr  given  zr 
is  known  for  some  N,  and  then  we  attempt  to  derive  an  expression  for  the  conditional  distribution  of 
Xfl+a  given  Zu+1.  We  recognize  this  task  as  similar  to  the  measurement  partitioning  of  Section  5.2.2,  in  that 
we  want  to  simplify  the  solution  by  processing  the  measurements  one  at  a  time.  Equations  (5.2-2)  and  (7.2-1) 
express  similar  ideas  and  give  the  basis  for  the  simplifications  in  both  cases.  (The  xr  of  Equation  (7.2-1) 
corresponds  to  the  e  of  Equation  (5.2-2).) 

Our  task  then  is  to  derive  p(xfj+,|Zw+1).  We  will  divide  this  task  into  two  steps.  First,  derive 
p(xN+1[Zu)  from  p(x»j|Z|j).  This  is  called  the  prediction  step,  because  we  are  predicting  x^+i  based  on  pre¬ 
vious  information.  It  is  also  called  the  time  update  because  we  are  updating  the  estimate  to  a  new  time  point 
based  on  the  same  data.  The  second  step  Is  to  derive  p(xn+,|Zn+i)  from  p(xfj+1|Znj).  This  Is  called  the 
correction  step,  because  we  are  correcting  the  predicted  estimate  of  xr+j  based  on  the  new  information  In 
zN+i<  It  Is  also  called  the  measurement  update  because  we  are  updating  the  estimate  based  on  the  new 
measurement. 


Since  all  of  the  distributions  are  assumed  to  be  Gaussian,  they  are  completely  defined  by  their  means  and 
covariance  matrices.  Denote  the  (presumed  known)  mean  and  covariance  of  the  distribution  p (x|q [ Zgq)  by  xu  and 
Pf),  respectively.  In  general,  xw  and  Pjj  are  functions  of  Zty,  but  we  will  not  encumber  the  notation  with  this 
information.  Likewise,  denote  the  mean  and  covariance  of  p(x^+1|Zji()  by  x^+a  and  Qu+1.  The  task  is  thus  to 
derive  expressions  for  X|q+1  and  Qj|+1  in  terms  of  x^  and  Pfj,  and  expressions  for  Xn+1  and  P^+1  in  terms  of 
*N+i  and  QN+X. 

7.2.1  Prediction  Step 

The  prediction  step  (time  update)  is  straightforward.  For  xn+x,  simply  take  the  expected  value  of 
Equation  (7.0-la)  conditioned  on  Z^. 

^xN+i^ZN^  =  $^XN^ZN^  +  *UN  +  ^nNlZN^  (7.2-2) 


The  quantities  EfXN+JZfl)  and  E{xn|Zh)}  are,  by  definition,  xn+,  and  xw,  respectively.  Zr  is  a  function  of 
x0,  n#,...,nN-1,n1,...Tiw,  and  deterministic  quantities;  n^  is  independent  of  all  of  these,  and  therefore 
independent  of  Zfl.  Thus 


E{nN|ZN}  =  E{nN)  =  0 

Substituting  this  into  Equation  (7.2-2)  gives 


N+i 


$xN  +  fu 


N 


(7.2-3) 


(7.2-4) 


In  order  to  evaluate  Qn+x,  take  the  covariance  of  both  sides  of  Equation  17. 0-la).  Since  the  three  terms  on 
the  right-hand  side  of  the  equation  are  independent,  the  covariance  of  their  sum  is  the  sum  of  their 
covariances. 


covtxN+JZN}  =  *  +  cov{*uN|ZN}  +  F  cov{nN|ZN)F* 


(7.2-5) 


The  terms  cov{xn+x|Zn)  and  cov{xn|Zn)  are,  by  definition,  Qn+x  and  Pr,  respectively.  Is  deterministic 
and,  thus,  has  zero  covariance.  By  the  Independence  of  n^  and  Zn 

cov{n^|ZN>  *  cov{nN}  ■  I  (7.2-6) 

Substituting  these  relationships  Into  Equation  (7.2-5)  gives 

Qn+1  “  *PN**  +  FF*  (7.2-7) 

Equations  (7.2-4)  and  (7.2-7)  constitute  the  results  desired  for  the  prediction  step  (time  update)  of  the 
filtering  problem.  They  readily  generalize  to  predicting  more  than  one  sample  ahead.  These  equations  justify 
our  earlier -statement  that,  once  the  filtering  problem  Is  solved,  the  prediction  problem  Is  easy;  for  suppose 
we  desire  to  estimate  x^  based  on  Zn  with  M  >  N.  If  we  can  solve  the  filtering  problem  to  obtain  x^,  the 
filtered  estimate  of  xr,  then,  by  a  straightforward  extension  of  Equation  (7.2-4), 


M-i 

E{xM|Zn>  ■  2  (7.2-8) 

1-N 


Is  the  desired  MAP  estimate  of  x^. 

7.2.2  Correction  Step 

For  the  correction  step  (measurement  update),  assume  that  we  know  the  mean,  xn+i>  and  covariance,  Qn+x,  of 
the  distribution  of  xn+x  given  Z^.  Me  seek  the  distribution  of  xn+x  given  both  Zn  and  z^.  From 
Equation  (7.0-lb) 


zN+i  =  CxN+i  +  DuN+i  +  GnN+i  (7.2-9) 

The  distribution  of  nN+i  Is  Gaussian  with  zero  mean  and  Identity  covariance.  By  the  same  argument  as  used 
for  r.^,  nn+1  1s  Independent  of  Zr.  Thus,  we  can  say  that 

p(t1N+JZN^  "  p{nN+i)  (7.2-10) 

This  trivial-looking  statement  is  the  key  to  the  problem,  for  now  everything  In  the  problem  Is  conditioned  In 
Zn,  we  know  the  distributions  of  xn+i  and  on+i  conditioned  on  Zn,  and  we  seek  the  distribution  of  xn+x 
conditioned  on  Zn,  and  additionally  conditioned  on  zn+x- 

This  problem  is  thus  exactly  In  the  form  of  Equation  (5.1-1),  except  that  all  of  the  distributions 
Involved  are  conditioned  on  Zn-  This  amounts  to  nothing  mure  than  restating  the  problem  of  Chapter  5  on  a 
different  probability  space,  one  conditioned  on  Zn.  The  previous  results  apply  directly  to  the  new  probabll 
ity  space.  Therefore,  from  Equations  (5.1-14)  and  (5.1-15) 

xN+i  “  xN+i  +  PN+iC*(GG*)  1^zN+i  "  CxN+i  ‘  DuN+i ^  (7.2-11) 

PN+X  =  (C*(GG*)'lC  +  Q-^)-1  (7.2-12) 

In  obtaining  Equations  (7.2-11)  and  (7.2-12)  from  Equations  (5.1-14)  and  (5.1-15),  we  have  Identified  the 
following  quantities: 


(5. 1-14), (5. 1-15) 

(7. 2-11),  (7. 2-12) 

\ 

XN+x 

P 

Z 

zN+i 

C 

C 

D 

DVi 

EU|Z) 

xN+i 

covU|Z) 

PN+i 

GG* 

GG* 

This  completes  the  derivation  of  the  correction  step  (measurement  update),  which  we  see  to  be  a  direct  appli¬ 
cation  of  the  results  *rom  Chapter  5. 

7.2.3  Kalman  Filter 


To  complete  the  recursive  solution  to  the  filtering  problem,  we  need  only  know  the  solution  for  some  value 
of  N,  and  we  can  now  propagate  that  solution  to  larger  N.  The  solution  for  N  »  0  is  Immediate  from  the 
Initial  problem  statement.  The  distribution  of  x0,  conditioned  on  Z0  (l.e.,  conditioned  on  nothing  because 
Z1  ■  (zx,...,zh)*),  Is  given  to  be  Gaussian  with  mean  m0  and  covariance  P0. 


Let  us  now  fit  together  the  pieces  derived  above  to  show  how  to  solve  the  filtering  problem: 
Step  1:  Initialization 


7? 


Define  x0  ■  m# 
P#  Is  given 

Step  2:  Prediction  (time  update),  starting  with  1  ■  0, 


*1+1  -  **1  +  ,U1 

(7.2-13) 

51+1  “  c*1+l  +  Du1+1 

(7.2-14) 

Q1+l  “  tp1**  +  FF* 

(7.2-15) 

Correction  (measurement  update) 

P1+x  -  (C*(GG*)_1C  +  q^)-1 

(7.2-16) 

*1+i  "  *1+i  +  Pi+iC*(GG*)-M21+1  -  21+1) 

(7.2-17) 

He  have  defined  the  quantity  z^+1  by  Equation  (7.2-14)  In  order  to  make  the  fora  of  Equation  (7.2-17)  more 
apparent;  z-j+,  can  easily  be  shown  to  be  Elz^+JZj}.  Repeat  the  prediction  and  correction  steps  for 
1  «  0,  1 . N  -  1  In  order  to  obtain  the  MAP  estimate  of  x^  based  on  zx,...,zn. 

Equations  (7.2-13)  to  (7.2-17)  constitute  the  Kalman  filter  for  discrete-time  systems.  The  recursive  form 
of  this  filter  Is  particularly  suited  to  real-time  applications.  Once  has  been  computed.  It  Is  not 
necessary,  as  It  was  using  the  methods  of  Section  7.1,  to  start  from  scratch  In  order  to  compute  xu+1;  we  need 
do  only  one  more  prediction  step  and  one  more  correction  step.  It  Is  extremely  Important  to  note  that  the 
computational  cost  of  obtaining  from  x^  Is  not  a  function  of  N.  This  means  that  real-time  Kalman 

filters  can  be  Implemented  using  fixed  finite  resources  to  run  for  arbitrarily  long  time  Intervals.  This  was 
not  the  case  using  the  methods  of  Section  7.1,  where  the  estimator  started  from  scratch  for  each  time  point, 
and  each  new  estimate  required  more  computation  than  the  previous  estimate.  For  some  applications,  it  Is  also 
important  that  the  P^  and  Qi  do  not  depend  on  the  measurements,  and  can  thus  be  precomputed.  Such  precompu¬ 
tation  can  significantly  reduce  real-time  computational  requirements. 

None  of  these  advantages  should  obscure  the  fact  that  the  Kalman  filter  obtains  the  same  estimates  as  were 
obtained  in  Section  7.1.  The  advantages  of  the  Kalman  filter  lie  In  the  easier  computation  of  the  estimates, 
not  in  improvements  in  the  accuracy  of  the  estimates. 

7.2.4  Alternate  Forms 


The  filter  Equations  (7.2-13)  to  (7.2-17)  can  be  algebraically  manipulated  into  several  equivalent  alter¬ 
nate  forms.  Although  all  of  the  variants  are  formally  equivalent,  different  ones  have  computational  advantages 
in  different  situations.  Some  of  the  advantages  lie  in  different  points,  of  singularity  and  different  size 
matrices  to  Invert.  We  will  show  a  few  of  the  possible  alternate  forms  in  this  section. 

The  first  variant  comes  from  using  Equations  (5.1-12)  and  (5.1-13)  (the  covariance  form)  Instead  of 
(5.1-14)  and  (5.1-15)  (the  information  form).  Equations  (7.2-16)  and  (7.2-17)  then  become 

P1+i  =  Qi+i  ‘  Qi+lC*(CQi+lC*  +  GG*)-lCQi+1  (7.2-18) 

*i+i  "  *1+i  +  ViC*(CQi+iC*  +  GG*)-Mzi+1  -  z1+1)  (7.2-19) 

The  covariance  fora  Is  particularly  useful  If  GG*  or  any  of  the  Q}  are  singular.  The  exact  conditions 
under  which  Q-j  can  become  singular  are  fairly  complicated,  but  we  can  draw  some  simple  conclusions  from  look¬ 
ing  at  Equation  (7.2-15).  First,  If  FF*  Is  nonsingular,  then  Qj  can  never  be  singular.  Second,  a  singular 
P0  (particularly  P,  =  0)  Is  likely  to  cause  problems  if  FF*  Is  also  singular.  The  only  matrix  invert  in 
Equations  (7.2-18)  and  (7.2-19)  is  CQi+xC*  +  GG*.  If  this  matrix  is  singular  the  problem  is  ill-posed;  the 
situation  Is  the  same  as  that  discussed  in  Section  5.3.3. 

Note  that  the  covariance  fora  Involves  inversion  of  an  z-by-z  matrix,  where  z  is  the  length  of  the 
observation  vector.  On  the  other  hand,  the  information  fora  involves  Inversion  of  a  p-by-p  matrix,  where  p 

is  the  length  of  the  state  vector.  For  some  systems,  the  difference  between  z  and  p  may  be  significant, 

resulting  in  a  strong  preference  for  one  form  or  the  other. 

If  G  is  diagonal  (or  if  GG*  Is  diagonallzable  the  system  can  be  rewritten  with  a  diagonal  G), 
Equations  (7.2-18)  and  (7.2-19)  can  be  manipulated  into  a  form  that  involves  no  matrix  inversions.  The  key  to 
this  manipulation  is  to  consider  the  system  to  have  z  Independent  scalar  observations  at  each  time  point 
Instead  of  a  single  vector  observation  of  length  a.  The  scalar  observations  can  then  be  processed  one  at  a 
time.  The  Kalman  filter  partitions  the  estimation  problem  by  processing  the  measurements  one  time-point  at  a 
time;  with  this  modification,  we  extend  the  same  partitioning  concept  to  process  one  element  of  the  measurement 
vector  at  a  time.  The  derivation  of  the  measurement-update  Equations  (7.2-18)  and  (7.2-19)  applies  without 
change  to  a  system  with  several  Independent  observations  at  a  time  point.  We  need  only  apply  the  measurement- 
update  equation  z  times  with  no  Intervening  time  updates.  We  do  need  a  little  more  complicated  notation  to 
keep  track  of  the  process,  but  the  equations  are  basically  the  same. 
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Let  C^  and  D^  be  the  jth  rows  of  the  C  and  0  matrices,  6^*^  be  the  jth  diagonal  element  of 

G,  and  z|||  be  the  jth  element  of  zi+1.  Define  ^i+.,j  to  be  the  estimate  of  x^+1  after  the^  jth 
scalar  observation  at  time  1  +  1  has  been  processed,  ana  define  P{+i,j  to  be  the  covariance  of  x^+1|j. 
We  start  the  measurement  update  at  each  time  point  with 


'l+i, » 


'1+i 


(7.2-20) 


P 


1+1,0 


*1+1 


(7.2-21) 


Then,  for  each  scalar  measurement,  we  do  the  update 

p  m  p  _  p  r(J)*  +  e( j »J)* \-ir ( j)p 

P1+i,j+i  P1+i,j  P1+i,jC  'C  P1+i.jC  +  6  ’  C  P1+i,j 

{  *  I  +  p  r(j)*(r(J)p  r(J)*  +  g(j>j)2\-i/2(j'*'1)  .  iU*1)} 

x1+i,j+i  x1+i,j  +  P1+i,jC  tC  P1+i,jC  +  '  U1+i  Z1+i  } 

where 

;(j+0  .  r(j+1)o  +  D(j+1)u 

z1+i  c  x1+i,j  +  D  u1+i 


(7.2-22) 

(7.2-23) 


(7.2-24) 


Note  that  the  Inversions  In  Equations  (7.2-22)  and  (7.2-23)  are  scalar  Inversions  rather  than  matrices.  None 
of  these  scalars  will  be  0  unless  CQi+iC*  +  GG*  Is  singular.  After  processing  all  *  of  the  scalar  measure¬ 
ments  for  the  time  point,  we  have 


x 

P 


A 

1+1  “  X1+l,i 

(7.2-25) 

K  D 

1+1  1+1, t 

(7.2-26 

7.2.5  Innovations 


A  discussion  of  the  Kalman  filter  would  be  Incomplete  without  some  mention 
vatlon  at  sample  point  1,  also  called  the  residual,  Is 

of  the  Innovations.  The  Inno- 

vrv  l\ 

(7.2-27) 

where 

Z1  “  E{21  !Zi-i}  “  Cxi  +  ^1 

(7.2-28) 

Following  the  notation  for  Z-j, 

we  define 

V1  =  [W-"»vi3* 

(7.2-29) 

Now  V-j  Is  a  linear  function  of  l\.  This  Is  shown  by  Equations  (7.2-13)  to  (7.2-17)  and  (7.2-27),  which  give 
formulae  for  computing  the  v^  in  terms  of  the  zi.  It  may  not  be  Immediately  obvious  that  this  function  Is 
invertible.  We  will  prove  Invertlbillty  by  writing  the  Inverse  function;  l.e.,  by  expressing  Zj  in  terms  of 
V^.  Repeating  Equations  (7.2-13)  and  (7.2-14): 

xi+i  =  $*1  +  YU1 

(7.2-30a) 

z1+i  =  Cxi+i  +  Dui+i 

(7.2-30b) 

Substituting  Equation  (7.2-27) 

Into  Equation  (7.2-17)  gives 

*1+1  =  *1+1  +  Pi+iC*<GG*rv1+l 

(7.2-30c) 

Finally,  from  Equation  (7.2-27) 

z1+l  =  zi+l  +  v1+l 

(7.2-30d) 

Equation  (7.2-30)  Is  called  the  Innovations  form  of  the  system.  It  gives  the  recursive  formula  for  computing 
the  z,  from  the  v-|. 


Let  us  examine  the  distribution  of  the  Innovations.  The  Innovations  are  obviously  Gaussian,  because  they 
are  linear  functions  of  Z,  which  Is  Gaussian.  Using  Equation  (3.3-10),  it  Is  inmediate  that  the  mean  of  the 
Innovation  Is  0. 

"V  ‘  tlt1 


E{v<}  =  E{Z<  -  E(z1|Z1_1)} 


■  E{z1}-EtE(zi|Zi.i)}  =  0 


(7.2-31) 


Derive  the  covariance  matrix  of  the  innovation  by  writing 

v^  *  Cx.|  +  Du^  +  Gn.|  -  Cx^  -  Du.. 
■  C(x^  -  x^  +  Gn^ 


(7.2-32) 
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The  tm  terms  on  the  right  are  Independent,  so 

cov(v^)  -  C  cov(Xj  -  i^)C*  +  GG* 

-  CI^C*  +  GG*  (7.2-33) 

The  most  Interesting  property  of  the  Innovations  Is  that  Vf  Is  Independent  of  v<  for  1  t  j.  To 
prove  this,  It  Is  sufficient  to  show  that  Vj  Is  Independent  of  Vj_,.  Let  us  examine  E{vj|Vi_,}.  Since 
V'j.j  Is  obtained  from  Z by  an  Invertible  continuous  transformation,  conditioning  on  Is  the  same 

as  conditioning  on  Z^.  (If  one  Is  known,  so  Is  the  other.)  Therefore, 

Etv1|V1_1>  -  «v1lZi_x}  -  0  (7.2-34) 

as  shown  In  Equation  (7.2-31).  Thus  we  have 

E<v1 I vi_xi  *  Etv1>  (7.2-35) 

Comparing  this  equation  with  the  formula  for  the  Gaussian  conditional  mean  given  In  Theorem  (3.5-9),  we  see 
that  this  can  be  true  only  If  v^  and  V<_,  are  uncorrelated  (ai2  “0  In  the  theorem).  Then  by 
Theorem  (3.5-8),  v-j  and  V^.j  are  Independent. 

The  Innovation  Is  thus  a  discrete-time  white-noise  process  (l.e.,  each  time  point  Is  Independent  of  all 
of  the  others).  Thus,  the  Kalman  filter  Is  often  called  a  whitening  filter;  It  creates  a  white  process  (V) 
as  a  function  of  a  nonwhite  process  (Z). 


7.3  STEADY-STATE  FORM 

The  largest  computational  cost  of  the  Kalman  filter  1$  In  the  computation  of  the  covariance  matrix 
using  Equations  (7.2-15)  and  (7.2-16)  (or  any  of  the  alternate  forms).  For  a  large  and  important  class  of 
problems,  we  can  replace  P<  and  Q-j  by  constants  P  and  Q,  Independent  of  time.  This  approach  significantly 
lowers  computational  cost  of  the  filter. 

We  will  restrict  the  discussion  In  this  section  to  time-invariant  systems;  In  only  a  few  special  cases 
do  time-invariant  filters  make  sense  for  time-varying  systems. 

Equations  that  a  time  Invariant  filter  must  satisfy  are  easily  derived.  Using  Equations  (7.2-18) 
and  (7.2-15),  we  can  express  Q^+1  as  a  function  of  Qj. 

Q1+1  -  *CQi  -  QfCMCQjC*  +  GG*)*1CQi]t*  +  FF*  (7.3-1) 

Thus,  for  Q-)  to  equal  a  constant  Q,  we  must  have 

Q  «  *[Q  -  QC*(CQC*  +  GG*)"^]**  +  FF*  (7.3-2) 

This  Is  the  algebraic  matrix  Rlccatl  equation  for  discrete-time  systems.  (An  alternate  form  can  be  obtained 
by  using  Equation  (7.2-16)  In  place  of  Equation  (7.2-18);  the  condition  can  also  be  written  In  terms  of  P 
instead  of  Q). 

If  Q  Is  a  scalar,  the  algebraic  Rlccatl  equation  Is  a  quadratic  equation  In  Q  and  the  solution  1$ 
simple.  For  nonscalar  Q,  the  solution  Is  far  more  difficult  and  has  been  the  subject  of  numerous  papers. 

We  will  not  cover  the  details  of  deriving  and  implementing  numerical  methods  for  solving  the  Rlccatl  equation. 
The  most  widely  used  methods  are  based  on  eigenvector  decomposition  (Potter,  1966;  Vaughan,  1970;  and  Geyser 
and  Lehtlnen,  1975).  When  a  unique  solution  exists,  these  methods  give  accurate  results  with  small  computa¬ 
tional  costs. 

The  derivation  of  the  conditions  under  which  Equation  (7.3-2)  has  an  acceptable  solution  Is  more  compli¬ 
cated  than  would  be  appropriate  for  Inclusion  In  this  text.  We  therefore  present  the  following  result  without 
proof: 


Theorem  7.3-1  If  all  unstable  or  marginally  stable  modes  of  the  system  are 
controllable  by  the  process  noise  and  are  observable,  and  If  CFF*C*  +  GG* 

Is  Invertible,  then  Equation  (7.3-2)  has  a  unique  positive  semideflnlte  solu¬ 
tion  and  Qi  converges  to  this  solution  for  all  choices  of  the  Initial 
covariance,  P0. 

Proof  See  Schweppe  (1973,  p.  142)  for  a  heuristic  argument,  or  Balakrlshnan 
(1961)  and  Kallath  and  Lyung  (1976)  for  more  rigorous  treatments. 

The  condition  on  CFF*C*  +  GG*  ensures  that  the  problem  is  well-posed.  Without  this  condition,  the  Inverse 
In  Equation  (7.3-1)  may  not  exist  for  some  Initial  P#  (particularly  P#  *  0).  Some  statements  of  the  theorem 
Incorporate  the  stronger  requirement  that  GG*  be  Invertible,  but  the  weaker  condition  Is  sufficient.  Perhaps 
the  most  Important  point  to  note  Is  that  the  system  Is  not  required  to  be  stable.  Although  the  existence  and 
uniqueness  of  the  solution  are  easier  to  prove  for  stable  systems,  the  more  general  conditions  of 
Theorem  (7.3-1)  are  Important  In  the  estimation  and  control  of  unstable  systems. 

We  can  achieve  a  heuristic  understanding  of  the  need  for  the  conditions  of  Theorem  (7.3-1)  by  examining 
one-dimensional  systems,  for  which  we  can  write  the  solutions  to  Equation  (7.3-2)  explicitly.  If  the  system 
Is  one-dimensional,  then  It  Is  observable  If  C  Is  nonzero  (and  G  Is  finite),  and  It  Is  controllable  by  the 
process  noise  If  F  Is  nonzero.  We  will  consider  the  problem  In  several  cases. 
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Case  1:  G  «  0.  In  this  case,  we  must  have  C  ^  0  and  F  f  0  In  order  for  the  problem  to  be  well-posed. 
Equation  (7.3-1)  then  reduces  to  Q^+,  ■  FF*,  giving  a  unique  time-invariant  covariance  satisfying 
Equation  (7.3-2). 

Case  2:  G  f  0,  C  ■  0,  F  *  0.  In  this  case,  Equation  (7.3-1)  becomes  Qj+,  ■  *2Q.j.  This  converges  to 
Q  «  0 if f#|  <  1  (stable  system).  If  |*|  «  1,  Qi  remains  at  the  starting  value,  and  thus  the  steady  state 
covariance  Is  not  unique.  If  |«|  >  1,  the  solution  diverges  or  stays  at  0,  depending  on  the  starting  value. 

Case  3:  G  t  0,  C  ■  0,  F  f  0.  In  this  case.  Equation  (7.3-2)  reduces  to 

Q  ■  *20  +  F2  (7.3-3) 

For  |«|  <  1,  this  equation  has  a  unique,  nonnegative  solution 

Q  =  (7.3-4) 

1  -  *2 


and  convergence  of  Equation  (7.3-1)  to  this  solution  Is  easily  shown.  If  |«|  i  1,  the  solution  Is  negative, 
which  Is  not  an  admissible  ovariance,  or  Infinite;  In  either  event.  Equation  (7.3-1)  diverges  to  Infinity. 

Case  4:  G  f  0,  C  f  0,  F  ■  0.  In  this  case,  Equation  (7.3-2)  Is  a  quadratic  equation  with  roots  zero  and 
(*2  -  1)G2/C2.  If  J * |  <  1,  the  second  root  Is  negative,  and  thus  there  Is  a  unique  nonnegative  root.  If 
|*|  =  1,  there  Is  a  double  root  at  zero,  and  the  solution  Is  still  unique.  In  both  of  these  events,  conver¬ 
gence  of  Equation  (7.3-1)  to  the  solution  at  0  Is  easy  to  show.  If  |*|  >  1,  there  are  two  nonnegative  roots, 
and  the  system  can  converge  to  either  one,  depending  on  whether  or  not  the  initial  covariance  Is  zero. 


Case  5:  G  f  0,  C  f  0,  F  f  0.  In  this  case,  Equation  (7.3-2)  is  a  quadratic  equation  with  roots 


Q 


(1/2)H  ± 


(7.3-5) 


where 


H  -  F1  +  (*2  -  1)G2/C2  (7.3-6) 

Regardless  of  the  value  of  *,  the  square-root  term  is  always  larger  In  magnitude  than  (1/2)H;  therefore,  there 
Is  one  positive  and  one  negative  root.  Convergence  of  Equation  (7.3-1)  to  the  positive  root  is  easy  to  show. 

Let  us  now  summarize  the  results  of  these  five  cases.  In  all  well-posed  cases,  the  covariance  converges 
to  a  unique  value  if  the  system  Is  stable.  For  unstable  or  marginally  stable  systems,  a  unique  converged  value 
is  assured  If  both  C  and  F  are  nonzero.  For  one-dimensional  systems,  there  is  also  a  unique  convergent  solu¬ 
tion  for  |*|  =  1,  G  f  0,  C  f  0,  F  =  0;  this  case  illustrates  that  the  conditions  of  Theorem  (7.3-1)  are  not 
necessary,  although  they  are  sufficient. 


Heuristically,  we  can  say  that  observability  (C  f  0)  prevents  the  covariance  from  diverging  to  infinity 
for  unstable  systems.  Controllability  by  the  process  noise  (F  f  0)  ensures  uniqueness  by  eliminating  the 
possibility  of  perfect  prediction  (Q  =  0). 


An  important  related  question  to  consider  is  the  stability  of  the  filter.  We  define  the  corrected  error 
vector  to  be 


S1  =  xi  -  (7.3-7) 

Using  Equations  (7.0-1),  (7.2-15),  (7.2-16),  and  (7.2-19)  gives  the  recursive  relationship 

ei+1  =  (I  -  KC)*^.  +  (I  -  KC)Fn.  -  KGni+1  (7.3-8) 

where 

K  =  PC*(GG*)'1  =  QC*(CQC*  +  GG*)"1  (7.3-9) 

We  can  show  that,  given  the  conditions  of  Theorem  (7.3-1),  the  system  of  Equation  (7.3-8)  Is  stable.  This 
stability  implies  that,  in  the  absence  of  new  disturbances,  (noise)  errors  in  the  state  estimate  will  die  out 
with  time;  furthermore,  for  bounded  disturbances,  the  errors  will  always  be  bounded.  A  rigorous  proof  is  not 
presented  here. 

1  It  is  interesting  to  examine  the  stability  of  the  one-dimensional  example  with  G  f  0,  C  f  0,  F  =  0,  and 
|*|  =  1.  We  previously  noted  that  Qi  for  this  case  converges  to  0  for  all  initial  covariances.  Let  us 
examine  the  steady-state  filter.  For  this  case,  Equation  (7.3-8)  reduces  to 

e1+1  =  e1  (7.3-10) 

which  Is  only  marginally  stable.  Recall  that  this  case  did  not  meet  the  conditions  of  Theorem  (7.3-1),  so  our 
stability  guarantee  does  not  apply.  Although  a  steady-state  filter  exists,  It  does  not  perform  at  all  like  the 
time-varying  filter.  The  time-varying  filter  reduces  the  error  to  zero  asymptotically  with  time.  The  steady- 
state  filter  has  no  feedback,  and  the  error  remains  at  its  initial  value.  Balakrlshnan  (1984)  discusses  the 
steady-state  filter  in  more  detail. 


Two  special  cases  of  time-invariant  Kalman  filters  deserve  special  note.  The  first  case  Is  where  F  is 
zero  and  the  system  is  stable  (and  GG*  must  be  invertible  to  ensure  a  well-posed  problem).  In  this  case,  the 
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steady  state  Kalman  gain  K  Is  zero.  The  Kalman  filter  simply  Integrates  the  state  equation,  Ignoring  any 
available  measurements.  Since  the  system  Is  stable  and  has  no  disturbances,  the  error  will  decay  to  zero. 

The  same  filter  Is  obtained  for  nonzero  F  if  C  Is  zero  or  If  6  Is  Infinite.  The  error  does  not  then 
decay  to  zero,  but  the  output  contains  no  useful  Information  to  feed  back. 

The  second  special  case  Is  where  G  Is  zero  and  C  Is  square  and  Invertible.  FF*  must  be  Invertible 
to  ensure  a  well-posed  problem.  For  this  case,  the  Kalman  gain  Is  C"1.  The  estimator  then  reduces  to 

x1  -  -  Du^  (7.3-11) 

which  Ignores  all  previous  Information.  The  current  state  can  be  reconstructed  exactly  from  the  current  mea¬ 
surement,  so  there  Is  no  need  to  consider  past  data.  This  Is  the  antithesis  of  the  case  where  F  Is  0  and  no 
Information  from  the  current  measurement  Is  used.  Most  realistic  systems  lie  somewhere  between  these  two 
extremes. 


7.4  CONTINUOUS  TIME 

The  form  of  a  linear  continuous-time  system  model  Is 

x(t)  ■  Ax(t)  +  Bu(t)  +  F£n(t)  (7.4-la) 

z(t)  -  Cx(t)  +  Du(t)  +  Gcn(t)  (7.4-lb) 

where  n  and  n  are  assumed  to  be  zero-mean  white-noise  processes  with  unity  power  spectral  density.  The 
Input  u  Is  assumed  to  be  known  exactly.  As  in  the  discrete-time  analysis,  we  will  simplify  the  notation  by 
assuming  that  the  system  Is  time  Invariant.  The  same  derivation  applies  to  time-varying  systems  by  evaluating 
the  matrices  at  the  appropriate  time  points. 

He  will  analyze  Equation  (7.4-1)  as  a  limit  of  the  discrete-time  systems 

x(t.j  +  a)  *  (I  +  aAJx^)  +  ABu(t^)  +  A1/*Fcn(t1)  (7.4-2a) 

z(t.)  «  Cx(t^)  +  Du(ti )  +  A-l/lGcn(tt)  (7.4-2b) 

where  n  and  n  are  discrete-time  white-noise  processes  with  identity  covariances.  The  reasons  for  the  A1/* 
factors  were  discussed  In  Section  6.2. 

The  filter  for  the  system  of  Equation  (7.4-2)  Is  obtained  by  making  appropriate  substitutions  in  Equa¬ 
tions  (7.2-13)  to  (7,2-17).  Me  need  to  substitute  (I  +  AA)  In  place  of  »,  AB  1r.  place  of  v,  aFcF?  In  place 
of  FF*,  and  A'1GCG*  In  place  of  GG*.  Combining  Equations  (7.2-13),  (7.2-14),  and  (7.2-17)  and  making  the 
substitutions  gives 

xUj  +  A)  =  (I  +  AAJxUj)  +  ABu(tj)  +  AP(tt  +  A)C*(GcGJ)-l[z(tt  +  A)  -  C(I  +  AA)*^)  -  CABuUj)  -  Du(t.j  +  A)] 

(7.4-3) 


Subtracting  x(t^)  and  dividing  by  a  gives 
x(tj  +  a)  -  *(tj 

- - - - - —  «  Ax^)  +  Bu^)  +  P(t1  +  A)C*(acGc)*1[z(ti  +  A)  -  C(I  +  AAJx^)  -  CaBu(t,j )  -  Du^)] 

(7.4-4) 


Taking  the  limit  as  A  ■*  0  gives  the  filter  equation 

i(t)  =  Ax(t)  +  Bu(t)  +  ?(t)C*(GcG*)'1[z(t)  -  Cx(t)  -  Du(t)]  (7.4-5) 

It  remains  to  find  the  equation  for  P(t).  First  note  that  Equation  (7.2-15)  becomes 

Q(t1  +  A)  =  (I  +  AA)P(t.)(I  +  AA)*  +  AFCF*  (7.4-6) 

and  thus 

11m  Q(t,  +  A)  =  P(t,)  (7.4-7) 

A-*-o  1 


Equation  (7.2-18)  is  a  more  convenient  form  for  our  current  purposes  than  (7.2-16).  Make  the  appropriate  sub¬ 
stitutions  in  Equation  (7.2-18)  to  get 

P(t.  +  A)  =  Q(ti  +  A)  -  q(t.  +  A)C*(CQ(ti  +  A)C*  +  A'1GcG*)-1CQ(ti  +  A) 

Subtract  P(t^)  and  divide  by  a  to  give 


P(t,  +  A)  -  Pit.)  Q(t,  +  A)  -  P(tJ  ,  ,  , 

- 2 - L.  « - ! - i-  -  Q(ti  +  A)C*(ACQ(t1  +  A)C*  +  GcGJ)-1CQ(t1  +  A)  (7.4-9) 


A 


A 
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For  the  first  term  on  the  right  of  Equation  (7.4-9),  substitute  from  Equation  (7.4-7)  to  get 

Q(t,  +  a)  -  P(tJ 

— J - j -  APUj)  +  Pft^A*  +  aAP^JA*  +  FCF*  (7.4-10) 

Thus  in  the  limit  Equation  (7.4-9)  becomes 

P(t)  ■  AP(t)  +  P(t)A*  +  FCF*  -  PUKMG^jr^t)  (7.4-11) 

Equation  (7.4-11)  Is  the  continuous-time  Rlcattl  equation.  The  Initial  condition  for  the  equation  Is  P.  *  0, 
the  covariance  of  the  Initial  state.  P(  Is  assumed  to  be  known.  Equations  (7.4-5)  and  (7.4-11)  constitute 
the  solution  to  the  continuous-time  filtering  problem  for  linear  systems  with  white  process  and  measurement 
noise.  The  continuous-time  filter  requires  GG*  to  be  nonsingular. 

One  point  worth  noting  about  the  continuous-time  filter  Is  that  the  Innovation  z(t)  -  2(t)  Is  a  white- 
noise  process  with  the  same  power  spectral  density  as  the  measurement  noise.  (They  are  not,  however,  the  same 
process.)  The  power  spectrum  of  the  Innovation  can  be  found  by  looking  at  the  limit  of  Equation  (7.2-33). 
Making  the  appropriate  substitutions  gives 

cov^t,))  -  CQ(t^ )C*  +  A‘lGcGJ  (7.4-12) 

The  power  spectral  density  of  the  innovation  Is  then 

11m  A-1cov(v(t j ) )  *  G  GJ  (7.4-13) 

A-»  o  icc 

The  disappearance  of  the  first  term  of  Equation  (7.4-12)  In  the  limit  makes  the  continuous-time  filter  simpler 
than  the  discrete-time  one  In  many  ways. 

For  time-invariant  contl nuous-tlme  systems,  we  can  Investigate  the  possibility  that  the  filter  reaches  a 
steady  state.  As  In  the  discrete-time  steady-state  filter,  this  outcome  would  result  In  a  significant  compu¬ 
tational  advantage.  If  the  steady-state  filter  exists,  It  Is  obvious  that  the  steady-state  P(t)  must  satisfy 
the  equation 

AP  +  PA*  +  F.F*  -  PC*(G  G*)_1CP  -  0  (7.4-14) 

c  c  c  c 

obtained  by  setting  P  to  0  In  Equation  (7.4-11).  The  eigenvector  decomposition  methods  referenced  after 
Equation  (7.3-2)  are  also  the  best  practical  numerical  methods  for  solving  Equation  (7.4-14).  The  following 
theorem,  comparable  to  Theorem  (7.3-1),  Is  not  proven  here. 

Theorem  7.4-1  If  all  unstable  or  neutrally  stable  modes  of  the  system  are 
controllable  by  the  process  noise  and  are  observable,  and  If  GUGj  Is 
Invertible,  then  Equation  (7.4-14)  has  a  unique  positive  semldefinite  solu¬ 
tion,  and  P(t)  converges  to  this  solution  for  all  choices  of  the  initial 
covariance  Pt. 

Proof  See  Kallath  and  Lyung  (1976),  Balakrishnan  (1981),  or  Kalman  and 
Bucy- (1961). 


7.5  CONTINUOUS/DISCRETE  TIME 

Many  practical  applications  of  filtering  involve  discrete  sampled  measurements  of  systems  with  continuous¬ 
time  dynamics.  Since  this  problem  has  elements  of  both  discrete  and  continuous  time,  there  Is  often  debate 
over  whether  the  discrete-  or  continuous-time  filter  Is  more  appropriate.  In  fact,  neither  of  these  filters 
Is  appropriate  because  they  are  both  based  on  models  that  are  not  realistic  representations  of  the  true  system. 
As  Schweppe  (1973,  p.  206)  says, 

Some  rather  Interesting  arguments  sometimes  result  when  one  asks  the  question. 

Are  the  discrete-  or  the  continuous-time  results  more  useful?  The  answer  Is, 
of  course,  that  the  question  Is  stupid.... neither  is  superior  In  all  cases. 

The  appropriate  model  for  a  continuous-time  dynamic  system  with  discrete-time  measurements  Is  a  continuous-time 
model  with  discrete-time  measurements.  Although  this  statement  sounds  like  a  tautology,  its  point  has  been 
missed  enough  to  make  It  worth  emphasizing.  Some  of  the  confusion  may  be  due  to  the  mistaken  Impression  that 
such  a  mixed  model  could  not  be  analyzed  with  the  available  tools.  In  fact,  the  derivation  of  the  appropriate 
filter  Is  trivial,  given  the  pure  continuous-  and  pure  discrete-time  results.  The  filter  for  this  class  of 
problems  simply  Involves  an  appropriate  combination  of  the  discrete-  and  continuous-time  filters  previously 
derived.  It  takes  only  a  few  lines  to  show  how  the  previously  derived  results  fit  this  problem.  We  will  spend 
most  of  this  section  talking  about  Implementation  Issues  In  a  little  more  detail. 

Let  the  system  be  described  by 

x(t)  ■  Ax(t)  +  Bu(t)  +  Fcn(t) 

z^)  =  Cxft^  +  Du(ti)  +  Gg^)  1  =  1,2,... 


(7.5-la) 

(7.5-lb) 
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Equation  (7.5-la)  Is  Identical  to  Equation  (7.4-la) ;  and,  except  for  a  notation  change.  Equation  (7. 5- lb)  Is 
Identical  to  Equation  (7.0-lb).  Note  that  the  observation  Is  only  defined  at  the  discrete  points  tj, 
although  the  state  Is  defined  In  continuous  time. 


Between  the  times  of  two  observations,  the  analysis  of  Equation  (7.5-1)  Is  Identical  to  that  of  Equa¬ 
tion  (7.4-1)  with  an  Infinite  G  matrix  or  a  zero  C  matrix;  either  of  these  conditions  Is  equivalent  to 
having  no  useful  observation.  Let  x(tj)  be  the  state  estimate  at  time  tj  based  on  the  observations  ur  to 
and  Including  z(ti).  Then  the  predicted  estimate  In  the  Interval  (ti,tj+A]  Is  obtained  from 


The  covariance  of  the  prediction  Is 


x(t|)  «  x(t1) 

(7.5-2) 

x(t)  ■  Ax(t)  +  Bu(t) 

(7.5-3) 

Q(t|)  -  P(tj) 

(7.5-4) 

0(t)  -  AQ(t)  +  Q(t)A*  +  FCF* 

(7.5-5) 

Equations  (7.5-3)  and  (7.5-5)  are  obtained  directly  by  substituting  C  *  0  In  Equations  (7.4-5)  and  (7.4-11). 
The  notation  has  been  changed  to  Indicate  that,  because  there  Is  no  observation  In  the  Interval,  these  are 
predicted  estimates;  whereas.  In  the  pure  continuous-time  filter,  the  observations  are  continuously  used  and 
filtered  estimates  are  obtained.  Integrate  Equations  (7.5-3)  and  (7.5-5)  over  the  Interval  (t^.t^  +  a)  to 
obtain  the  predicted  estimate  x(t-|  +  a)  and  Its  covariance  Q(t*  +  a). 

In  practice,  although  u(t)  Is  defined  continuous!:. ,  It  will  often  be  measured  (or  otherwise  known)  only 
at  the  time  points  t|.  Furthermore,  the  Integration  will  likely  be  done  by  a  digital  computer  which  cannot 
Integrate  continuous-time  data  exactly.  Thus  Equation  (7.5-3)  will  be  Integrated  numerically.  The  simplest 
Integration  approximation  would  give 

x(tj  +  A)  a  (I  +  A)x(t*)  +  ABu(tj)  (7.5-6) 

This  approximation  may  be  adequate  for  some  purposes,  but  It  Is  more  often  a  little  too  crude.  If  the 
A  matrix  is  time-varying,  there  are  several  reasonable  Integration  schemes  which  we  will  not  discuss  here; 
the  most  coimton  are  based  on  Runge-Kutta  algorithms  (Acton,  1970).  For  systems  with  time-invariant  A 
matrices  and  constant  sample  Intervals,  the  transition  matrix  Is  by  far  the  most  efficient  approach.  First 
define 

4  ■  exp(AA)  (7.5-7) 

fA 

V  exp(At)dt  B  (7.5-8) 


x(t^  +  A)  s  *x(t*)  +  fu(t.|)  (7.5-9) 

This  approximation  is  the  exact  solution  to  Equation  (7.5-3)  if  u(t)  holds  its  value  between  samples. 

Wiberg  (1971)  and  Zadeh  and  Desoer  (1963)  derive  this  solution,  Moler  and  Van  Loan  (1978)  discuss  various 
means  of  numerically  evaluating  Equations  (7.5-7)  and  (7.5-8).  Equation  (7.5-9)  has  an  advantage  of  being 
in  the  exact  form  in  which  discrete-time  systems  are  usually  written  (Equation  (7.0-la)). 

Equation  (7.5-9)  introduces  about  1/2-sample  delay  In  the  modeling  of  the  response  to  the  control  input 
unless  the  continuous-time  u(t)  holds  its  value  between  samples;  this  delay  Is  often  unacceptable. 

Figure  (7.5-1)  shows  a  sample  input  signal  and  the  signal  as  modeled  by  Equation  (7.5-9).  A  better  approxima¬ 
tion  is  usually 


x(t.  +  A)  a  tx(t|)  +  (l/2)f(u(ti)  +  u(ti  +  A))  (7.5-10) 

This  equation  models  u(t)  between  samples  as  being  constant  at  the  average  of  the  two  sample  values. 

Figure  (7.5-2)  illustrates  this  model.  There  Is  little  phase  lag  In  the  model  represented  by  Equation  (7.5-10), 
and  the  difference  in  implementation  cost  between  Equations  (7.5-9)  and  (7.5-10)  is  negligible.  Equa¬ 
tion  (7.5-10)  is  probably  the  most  commonly  used  approximation  method  with  time-invariant  A  matrices. 

The  high-frequency  content  Introduced  by  the  jumps  in  the  above  models  can  be  removed  by  modeling  u(t)  as 
a  linear  Interpolation  between  the  measured  values  as  Illustrated  in  Figure  (7.5-3).  This  model  adds  another 
term  to  Equation  (7.5-10)  proportional  to  u(ti  +  a)  -  u(t-|).  In  our  experience,  this  degree  of  fidelity  is 
usually  unnecessary,  and  is  not  worth  the  extra  cost  and  complication.  There  are  some  applications  where  the 
accuracy  required  might  justify  this  or  even  more  complicated  methods,  such  as  higher-order  spline  fits.  (The 
linear  interpolation  is  a  first-order  spline.) 

If  you  are  using  a  Runge-Kutta  algorithm  instead  of  a  transition-matrix  algorithm  for  solving  the  differ¬ 
ential  equation,  linear  interpolation  of  the  input  Introduces  negligible  extra  cost  and  is  common  practice. 

Equation  (7.5-5)  does  not  Involve  measured  data  and  thus  does  not  present  the  problems  of  Interpolating 
between  the  measurements.  The  exact  solution  of  Equation  (7.5-5)  is 


.'y  S.T\1VVA*uCLV.lffVi  r,  to;  'v  vA"V’.i\  ^  *.3  vi  v mv  -i'j-i i-.  .-n- 
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Q(t1  +  4)  -  *Q(t})**  +  JT*  exp(A{i  -  t))FcF*  exp(A*(a  -  T))dr  (7.8-11) 

as  can  be  verified  by  substitution.  Note  that  Equation  (7.5-11)  Is  exactly  In  the  fora  of  a  dlecrtW-tlm* 
update  of  the  covariance  (Equation  (7.2-15))  If  F  Is  defined  as  a  square  root  of  the  Integral  tens.  For 
small  a,  the  Integral  term  Is  well  approximated  by  aFcF*,  resulting  In 

Q(t1  +  4)  >  aQ(t|)a*  ♦  aFcF*  (7.8-12) 

The  errors  In  this  approximation  are  usually  far  smaller  than  the  uncertainty  In  the  value  of  Fe,  amd  cam  thus 
be  neglected.  This  approximation  Is  significantly  better  than  the  alternate  approximation 

Q(t1  +  4)  «  Q(tJ)  +  4AQ(t|)  +  4Q(t*)A*  ♦  AFCF*  (7.8-13) 

obtained  by  Inspection  from  Equation  (7.S-5). 

The  above  discussion  has  concentrated  on  propagating  the  estimate  between  noasuremmats.  I.e. ,  the  tine 
update.  It  remains  only  to  discuss  the  measurement  update  for  the  discrete  maasuraaenta.  Me  been  1ft*)  end 
Q( ti )  at  some  time  point.  We  need  to  use  these  and  the  measured  data  at  the  time  point  te  obtain  8(tf)  and 
P(q).  This  is  Identical  to  the  discrete-time  measurement  update  problen  solved  by  Cquetlens  (7.2-18) 
and  (7.2-17).  We  can  also  use  the  alternate  forms  discussed  In  Section  7.2.4. 

To  start  the  filter,  we  are  given  the  a  priori  mean  x(t,)  and  covariance  Q(t.)  of  the  state  at  tlae  t«. 
Use  Equations  (7.2-16)  and  (7.2-17)  (or  alternates)  to  obtain  x(t()  and  F(t,l.  Integrate  Cpntlone  (7.8-2) 
to  (7.5-5)  from  t£  to  tx  by  some  means  (most  likely  Equations  (7.5-10)  and  (7.8-12)j  te  obtain  i(t,)  end 
Q(tj ) .  This  completes  one  time  step  of  the  filter;  processing  of  subsequent  tine  points  aaes  the  ewe 
procedure. 

The  solution  for  the  steady-state  form  of  the  discrete/continuous  filter  follow  1— dlelelj  bran  that  of 
the  discrete-time  filter,  because  the  equations  for  the  covariance  updates  are  Identical  Ver  the  bus  filters 
with  the  appropriate  substitution  of  F  In  terms  of  Fc.  Theorem  (7.3-l)  therefore  applies. 

We  can  summarize  this  section  by  saying  that  there  Is  a  contlnuous/dlscre  e-tlae  filter  derived  Ifoa 
appropriate  results  In  the  pure  discrete-  and  pure  continuous-time  analyses,  if  the  Input  e  held  Its  vnlee 
between  samples,  then  the  fora  of  the  continuous/discrete  filter  Is  Identical  to  that  of  the  para  discrete-tine 
filter  with  an  appropriate  substitution  for  the  equivalent  discrete-tine  process  noise  covariance.  For  aera 
realistic  behavior  of  u,  we  must  adopt  approximations  if  the  analysis  Is  done  on  a  digital  caapatnr.  It  Is 
also  possible  to  view  the  continuous-time  filter  equations  as  giving  reasonable  approximations  to  Ito 
conti nuous/dlscrete-tlme  filter  In  some  situations.  In  any  event,  we  will  not  go  wrong  os  long  os  an  rnwnln 
that  we  can  write  the  exact  filter  equations  for  the  conti nuous/dlscrete-tlan  systan  and  that  an  most  consider 
any  other  equations  used  as  approximations  to  the  exact  solution.  With  this  frame  of  mind  w  can  objoctlvoly 
evaluate  the  adequacy  of  the  approximations  Involved  for  specific  problems. 


7.6  SMOOTHING 

The  derivation  of  optimal  smoothers  draws  heavily  on  the  derivation  of  the  Kalman  filter.  Starting  from 
the  filter  results,  only  a  single  step  is  required  to  compute  the  smoothed  estimates.  In  this  section,  we 
briefly  derive  the  fixed-interval  smoother  for  discrete-time  linear  systems  with  additive  Gaussian  nolsa. 

Fixed- Interval  smoothers  are  the  most  widely  used.  The  same  general  principles  apply  to  deriving  fixed-point 
and  fixed-lag  smoothers.  See  Medltch  (1969)  for  derivations  and  equations  for  fixed-point  and  fixed-lag 
smoothers  and  for  continuous-time  forms. 

There  are  alternate  computational  forms  for  the  fixed-interval  smoother;  these  forms  give  mathematically 
equivalent  results.  We  will  not  discuss  computational  advantages  of  the  various  forms.  See  Bleraan  (1977) 
and  Bach  and  Wlngrove  (1983)  for  alternate  forms  and  discussions  of  their  advantages. 

Consider  the  fixed-interval  smoothing  problem  on  an  interval  with  N  time  points.  As  in  the  filter 
derivation,  we  will  concentrate  on  two  time  points  at  a  time  In  order  to  get  a  recursive  form.  It  Is  straight¬ 
forward  to  write  an  explicit  formulation  for  the  smoother,  like  the  explicit  filter  fora  of  Section  7.1,  but 
such  a  form  Is  impractical. 

In  the  nature  of  recursive  derivations,  assume  that  we  have  previously  computed  >q+,,  the  smoothed  esti¬ 
mate  of  Xj+1,  and  Si+j,  the  covariance  of  x^-n  given  Zn.  We  seek  to  derive  an  expression  for  jq  and  Si- 
Note  that  this  recursion  runs  backwards  In  time  Instead  of  forwards;  a  forward  recursion  will  not  work,  for 
reasons  which  we  will  see  later. 

The  smoothed  estimates,  5q  and  are  defined  by 


We  will  use  the  measurement  partitioning  Ideas  of  Section  5.2.2,  with  the  measurement  Zn  partitioned  into 
Zi  and 


*i+J  l>i  J 


(7.6-1) 
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From  th«  derivation  of  the  Kalman 
tloned  on  Z(.  It  Is  Gaussian  with 


filter,  we  can  write  the  joint  distribution  of  xi  and  xj+i  condl- 


(7.6-3) 


(7.6-4) 


Me  did  not  previously  derive  the  cross  term  In  the  above  covariance  matrix.  To  derive  the  form  shown,  write 
Et(x1  -  *j)(x1+1  -  «1+1)*>  "  -  *1)(*x1  +  vu1  +  Fn1  -  ♦x1  -  fu^*} 

-  E{(x^  -  i^Mx,  -  x1)*}a*+E{(x1  -  ftfHFn,)*} 


-  P^*  +  0  (7.6-5) 

For  the  second  step  of  the  partitioned  algorithm,  we  consider  the  measurements  2<,  using  Equa¬ 
tions  (7.6-3)  and  (7.6-4)  for  the  prior  distribution.  The  measurements  2j  can  be  written  In  the  form 

^1  ™  Vl+x  +  +  (7.6-6) 

for  same  n trices  Ci,  Dj,  and  Gi,  and  some  Gaussian,  zero-mean.  Identity-covariance  noise  vector  th. 

Although  we  could  laboriously  write  out  expressions  for  the  matrices  In  Equation  (7.6-6),  this  step  is  unneces 
saryi  we  need  only  know  that  such  a  form  exists.  The  Important  thing  about  Equation  (7.6-6)  Is  that  x^  does 

not  appear  In  It. 

Using  Equations  (7.6-3)  and  (7.6-4)  for  the  prior  distribution  and  Equation  (7.6-6)  for  the  measurement 
equation,  we  can  now  obtain  the  joint  posterior  distribution  of  xi  and  xi+x  given  Zj.  This  distribution  Is 
Gaussian  with  amen  and  covariance  given  by  Equations  (5.1-12)  and  (5.1-13),  substituting  Equation  (7.6-3)  for 
Equation  (7.6-4)  for  P,  Dj  for  D,  Si  for  6,  and 

C  -  [01^3  (7.6-7) 


Gy  definition  (Equation  (7.6-1)).  the  man  of  this  distribution  gives  the  smoothed  estimates  x^  and 
xi«x.  Making  the  substitutions  Into  Equation  (5.1-12)  and  expanding  gives 


,;i  m  [wi 

■  .  ♦  -  (Cv 

.*lej  l*1*J  lQ1+iCJ 


Vi*?  +  -  6,) 


(7.6-8) 


Me  can  solve  Equation  (7.6-8)  for  x^  In  terms  of  xi+x,  which  we  assume  to  have  been  computed  In  the  previous 
step  of  the  backwards  recursion. 


Xi  •  *1  +  -  *m>  (7-6-9> 

Equation  (7.6-9)  Is  the  backwards  recursive  form  sought.  Note  that  the  equation  does  not  depend  explic¬ 
itly  on  the  measurements  or  on  the  matrices  In  Equation  (7.6-6).  That  Information  Is  all  subsumed  In  xi+x. 
The  "Initial"  condition  for  the  recursion  Is 


xN  -  xN  (7.6-10) 

which  follows  directly  from  the  definitions.  We  do  not  have  a  corresponding  known  boundary  condition  at  the 
beginning  of  the  Interval,  which  Is  why  we  must  propagate  the  smoothing  recursion  backwards.  Instead  of 
forwards . 

We  can  now  describe  the  complete  process  of  computing  the  smoothed  state  estimates  for  a  fixed  time  Inter¬ 
val.  First  propagate  the  Kalman  filter  through  the  entire  interval,  saving  all  of  the  values  x-j,  x-|,  Pj,  and 
Ql-  Then  propagate  Equation  (7.6-9)  backwards  In  time,  using  the  saved  values  from  the  filter,  and  starting 
from  the  boundary  condition  given  by  Equation  (7.6-10). 

We  can  derive  a  formula  for  the  smoother  covariance  by  substituting  appropriately  Into  Equation  (5.1-13) 
to  get 


(The  off-diagonal  blocks  are  not  relevant  to  this  derivation.)  We  can  solve  Equation  (7.6-11)  for  in 
terms  of  Sj+1,  giving 
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si  -  pi  -  -  si+i)Qni*pi 

This  gives  us  a  backwards  recursion  for  the  smoother  covariance.  The  "Initial"  condition 

SN  -  PN  (7.6-13) 

follows  from  the  definitions.  Note  that,  as  In  the  recursion  for  the  smoothed  estimate,  the  measurements  and 
the  measurement  equation  matrices  have  dropped  out  of  Equation  (7.6-12).  All  the  necessary  data  about  the 
future  process  Is  subsumed  In  Si^.  Note  also  that  It  Is  not  necessary  to  compute  the  smoother  covariance  Si 
In  order  to  compute  the  smoothed  estimates. 


7.7  NONLINEAR  SYSTEMS  AND  NON-GAUSSIAN  NOISE 

Optimal  state  estimation  for  nonlinear  dynamic  systems  Is  substantially  more  difficult  than  for  linear 
systems.  Only  In  rare  special  cases  are  there  tractable  exact  solutions  for  optimal  filters  for  nonlinear  sys¬ 
tems.  The  same  comments  apply  to  systems  with  non-Gausslan  noise. 

Practical  Implementations  of  filters  for  nonlinear  systems  Invariably  Involve  approximations.  The  most 
conmon  approximations  are  based  on  linearizing  the  system  and  using  the  optimal  filter  for  the  linearized 
system.  Similarly,  non-Gausslan  noise  Is  approximated,  to  first  order,  by  Gaussian  noise  with  the  same  mean 
and  covariance. 

Consider  a  nonlinear  dynamic  system  with  additive  noise 

x(t)  ■  f (x(t) ,u(t) )  +  n(t)  (7.7-la) 

2^)  »  gMt^.uUj))  +  ^  (7.7-lb) 

Assume  that  we  have  some  nominal  estimate,  xn(t),  of  the  state  time  history.  Then  the  linearization  of 
Equation  (7.7-1)  about  this  nominal  trajectory  is 

x(t)  -  A(t)x(t)  +  B(t)u(t)  +  fn(t)  +  n(t)  (7.7-2a) 

zfV  -  C(t1)x(t1)  +  D^)  +  gn(ti)  +  ^  (7.7-2b) 

where 

A(t)  -  vxf(xn(t),u(t))  (7.7-3a) 

B(t)  -  Vuf(xn(t),u(t))  (7.7-3b) 

C(t)  -  vxg(xn(t),u(t))  (7.7-3c) 

D(t)  =  vug(xn(t),u(t))  (7.7-3d) 

fn(t)  *  f(x„(t),u(t))  (7.7-4a) 

gn(t)  ■  g(xn(t),u(t))  (7.7-4b) 

For  a  given  nominal  trajectory.  Equations  (7.7-2)  to  (7.7-4)  define  a  time-varying  linear  system.  The  Kalman 
filter/ smoother  algorithms  derived  in  previous  sections  of  this  chapter  give  optimal  state  estimates  for  this 
linearized  system. 

The  filter  based  on  this  linearized  system  Is  called  a  linearized  Kalman  filter  or  an  extended  Kalman 
filter  (EKF).  Its  adequacy  as  an  approximation  to  the  optimal  filter  for  the  nonlinear  system  depends  on 
several  factors  which  we  will  not  analyze  In  depth.  It  Is  a  reasonable  supposition  that  If  the  system  Is 
nearly  linear,  then  the  linearized  Kalman  filter  will  be  a  close  approximation  to  the  optimal  filter  for  the 
system.  If,  on  the  other  hand,  nonlinearities  play  a  major  role  in  defining  the  characteristic  system 
responses,  the  reasonableness  of  the  linearized  Kalman  filter  Is  questionable. 

The  above  description  Is  Intended  only  to  Introduce  the  simplest  ideas  of  linearized  Kalman  filters. 
Starting  from  this  point,  there  are  numerous  extensions,  modifications,  and  nuances  of  application.  Nonlinear 
filtering  Is  an  area  of  current  research.  See  Bach  and  Wlngrove  (1983)  and  Cox  and  Bryson  (1980)  for  a  few  of 
the  many  Investigations  In  this  field.  Schweppe  (1973)  and  Jazwinskl  (1970)  have  fairly  extensive  discussions 
of  nonlinear  state  estimation. 


*..  k' 
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CHAPTER  8 


8.0  OUTPUT  ERROR  METHOD  FOR  DYNAMIC  SYSTEMS 

In  previous  chapters,  we  have  covered  the  static  estimation  problem  and  the  estimation  of  the  state  of 
dynamic  systems.  With  this  background,  we  can  now  begin  to  address  the  principle  subject  of  this  book,  estlma 
tlon  of  the  parameters  of  dynamic  systems. 

Before  addressing  the  more  difficult  parameter  estimation  problems  posed  by  more  general  system  models, 
we  will  consider  the  simplified  case  that  leads  to  the  algorithm  called  output  error.  The  simplification  that 
leads  to  the  output-error  method  Is  to  omit  the  process-noise  term  from  the  state  equation.  For  this  reason, 
the  output-error  method  Is  often  described  by  terms  like  "the  no-process-noise  algorithm"  or  "the  measurement- 
noise-only  algorithm." 

We  will  first  discuss  mixed  contlnuous/dlscrete-tlme  systems,  which  are  most  appropriate  for  the  majority 
of  the  practical  applications.  We  will  follow  this  discussion  by  a  brief  sumnary  of  any  differences  for  pure 
discrete-time  systems,  which  are  useful  for  some  applications.  The  derivation  and  results  are  essentially 
Identical.  The  pure  continuous-time  results,  although  similar  In  expression.  Involve  extra  complications.  We 
have  never  seen  an  appropriate  practical  application  of  the  pure  continuous-time  results;  we  therefore  feel 
justified  In  omitting  them. 

In  mixed  continuous/discrete  time,  the  most  general  system  model  that  we  will  seriously  consider  Is 

x(t0)  -  x#  (8.0-la) 

x(t)  -  f[x(t),u(t),s]  (8.0-lb) 

z(t1)  -  g[x(t1),u(t1),s]  +  G(e)ni  1  -  1,2,...  (8.0-Ic) 


The  measurement  noise  n  Is  assumed  to  be  a  sequence  of  Independent  Gaussian  random  variables  with  zero  mean 
and  Identity  covariance.  The  Input  u  Is  assumed  to  be  known  exactly.  The  Initial  condition  x0  can  be 
treated  In  several  ways,  as  discussed  In  Section  8.2.  In  general,  the  functions  f  and  g  can  also  be  explicit 
functions  of  t.  We  omit  this  from  the  notation  for  simplicity.  (In  any  event,  explicit  time  dependence  can 

be  put  In  the  notation  of  Equation  (8.0-1)  by  defining  an  extra  control  equal  to  t.) 

The  corresponding  nonlinear  model  for  pure  discrete-time  systems  Is 

x(t„)  ■  x„  (8.0-2a) 

x(t1+1)  =  f[x(t.),u(t.),t]  1  -  0,1,...  (8.0-2b) 

z(tj)  *  g[x(t1),u(t1),5]  +  GU)^  1  *  1,2,...  (8.0-2c) 

The  assumptions  are  the  same  as  In  the  continuous/discrete  case. 

Although  the  output-error  method  applies  to  nonlinear  systems,  we  will  give  special  att  itlon  to  the 
treatment  of  linear  systems.  The  linear  form  of  Equation  (8.0-1)  is 

x(t0)  =  x0  (8.0-3a) 

x(t)  =  Ax(t)  +  Bu(t)  (8 . 0-3b ) 

z(tj)  =  Cx(t1)  +  Du(t1)  +  Gni  i  =  1,2,...  (8.0-3c) 

The  matrices  A,  B,  C,  D,  and  G  are  functions  of  5;  we  will  not  complicate  the  notation  by  explicitly  Indi¬ 
cating  this  relationship.  Of  course,  x  and  z  are  also  functions  of  t  through  their  dependence  on  the 

system  matrices. 

In  general,  the  matrices  A,  B,  C,  D,  and  G  can  also  be  functions  of  time.  For  notational  simplicity,  we 
have  not  explicitly  Indicated  this  dependence.  In  several  places,  time  Invariance  of  the  matrices  Introduces 
significant  computational  savings.  The  text  will  Indicate  such  situations.  Note  that  5  cannot  be  a  function 
of  time.  Problems  with  time-varying  5  must  be  reformulated  with  a  time-invariant  e  In  order  for  the  tech¬ 
niques  of  this  chapter  to  be  applicable. 

The  linear  form  of  Equation  (8.0-2)  Is 

x(t0)  =  x8  (8.C-4a) 

x(t1+l)  ■  ex(tj)  +  vu(t1)  1  ■  0,1,...  (8.0-4b) 

z  ( t  ^ )  =  Cx(ti)  +  Du^)  +  Gni  1  -  1,2,...  (8.0-4c) 

The  transition  matrices  *  and  ¥  are  functions  of  5,  and  possibly  of  time. 


For  any  of  the  model  forms,  a  prior  distribution  for  £  may  or  may  not  exist,  depending  on  the  particular 
application.  When  there  Is  no  prior  distribution,  or  when  you  desire  to  obtain  an  estimate  Independent  of  the 
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prior  distribution,  use  a  maximum-likelihood  estimator.  When  a  prior  distribution  Is  considered,  MAP  estimat¬ 
ors  are  appropriate.  For  the  parameter  estimation  problem,  a  posteriori  expected- value  estimates  and  Bayesian 
optimal  estimates  are  Impractical  to  compute,  except  In  special  cases.  The  posterior  distribution  of  c  Is 
not.  In  general,  symmetric;  thus  the  a  posteriori  expected  value  need  not  equal  the  MAP  estimate. 


8.1  DERIVATION 

The  basic  method  of  derivation  for  the  output-error  method  Is  to  reduce  the  problem  to  the  static  form  of 
Chapter  5.  We  will  see  that  the  dynamic  system  make'-  the  models  fairly  complicated,  but  not  different  In  any 
essential  way  from  those  of  Chapter  5.  We  first  consider  the  case  where  6  and  the  Initial  condition  are 
assumed  to  be  known. 


Choose  an  arbitrary  value  of  £.  Given  the  Initial  condition  x,  and  a  specified  Input  time-history  u, 
the  state  equation  (8.0-lb)  can  be  solved  to  give  the  state  as  a  function  of  time.  We  assume  that  f  Is 
sufficiently  smooth  to  guarantee  the  existence  and  uniqueness  of  the  solution  (Brauer  and  Noel,  1969).  For 
complicated  /  functions,  the  solution  may  be  difficult  or  Impossible  to  express  In  closed  form,  but  that 
aspect  Is  Irrelevant  to  the  theory.  (The  practical  Implication  Is  that  the  solution  will  be  obtained  using 
numerical  approximation  methods.)  The  Important  thing  to  note  Is  that,  because  of  the  elimination  of  the 
process  noise,  the  solution  Is  deterministic. 


For  a  specified  Input  u,  the  system  state  Is  thus  a  deterministic  function  of  £  and  time.  For  consis¬ 
tency  with  the  notation  of  the  filter-error  method  discussed  later,  denote  this  function  by  xr(t).  The  £ 
subscript  emphasizes  the  dependence  on  £.  The  dependence  on  u  Is  not  relevant  to  the  current  discussion, 
so  the  notation  Ignores  this  dependence  for  simplicity.  Assuming  known  G,  Equation  (8.0-lc)  then  becomes 

*(t1)  ■  gtx5(ti),u(ti),c]  +  Gnj  1  «  1,2,...  (8.1-1) 

Equation  (8.1-1)  Is  In  ihe  form  of  Equation  (5.4-1);  It  Is  a  static  nonlinear  model  with  additive  noise.  There 
are  multiple  experiments,  one  at  each  tj.  The  estimators  of  Section  5.4  apply  directly.  The  assumptions 
adopted  have  allowed  us  to  solve  the  system  dynamics,  leaving  an  essentially  static  problem. 


The  MAP  estimate  Is  obtained  by  minimizing  Equation  (5.4-9).  In  the  notation  of  this  chapter,  this  equa¬ 
tion  becomes 


N 

J(0  -  ?  Yj  WV  -  5c(t1)3*(Ge*)-XCz(tl)  ‘  +  7  (5'VP'1<e  -  V  (8.1-2) 

1*i 


where 


xc(t0)  *  x,  (8. l-3a) 

i5(t)  ■  f[x^(t),u(t),£]  (8.1-3b) 

ij(t|)  ■  g[x?(t1),u(t1),c]  1  -  1,2,...  (8. l-3c) 

The  quantities  mg  and  P  are  the  mean  and  covariance  of  the  prior  distribution  of  £,  as  In  Chapter  5.  For 
the  MLE  estimator,  omit  the  last  term  of  Equation  (8.1-2),  giving 

N 

J(0  -  i  Y  wv  ■  VV3**66*)’1^)' VV3  (8-1_4) 

1*1 

Equation  (8.1-4)  Is  a  quadratic  form  In  the  difference  between  z,  the  measured  response  (output),  and  z^,  the 
response  computed  from  the  deterministic  part  of  the  system  model.  This  motivates  the  name  "output  error." 

The  minimization  of  Equation  (8.1-4)  Is  an  Intuitively  plausible  estimator  defensible  even  without  statistical 
derivation.  The  minimizing  value  of  £  gives  the  system  model  that  best  approximates  (In  a  least-squares 
sense)  the  actual  system  response  to  the  test  Input.  Although  this  does  not  necessarily  guarantee  that  the 
model  response  and  the  system  response  will  be  similar  for  other  test  Inputs,  the  minimizing  value  of  £  Is 
certainly  a  plausible  estimate. 

The  estimates  that  result  from  minimizing  Equation  (8.1-4)  are  sometimes  called  "least  squares"  estimates. 
In  reference  to  the  quadratic  form  of  the  equation.  We  prefer  to  avoid  the  use  of  this  terminology  because  It 
is  potentially  confusing.  Many  of  the  estimators  applicable  to  dynamic  systems  have  a  least-squares  form,  so 
the  term  Is  not  definitive.  Furthermore,  th»  term  "least  squares"  Is  most  often  applied  to  Equation  (8.1-4) 
to  contrast  It  ••Tm  other  forms  labeled  "maxi.iium  likelihood"  (typically  the  estimators  of  Section  8.4,  which 
apply  to  unknown  G,  or  the  estimators  of  Chapter  9,  which  account  for  process  noise).  This  contrast  Is  mis¬ 
leading  because  Equation  (8.1-4)  describes  a  completely  rigorous,  maximum-likelihood  estimator  for  the  problem 
as  posed.  The  differences  between  Equation  (8.1-4)  and  the  estimators  of  Sections  8.4  and  Chapter  9  are 
differences  In  the  problem  statement,  not  differences  In  the  statistical  principles  used  for  solution. 

To  derive  the  output-error  method  fo>  pure  discrete-time  systems,  substitute  the  discrete-time  Equa¬ 
tion  (8.0-2bl  In  place  of  Equation  (8.0-lb).  The  derivation  and  the  result  are  unchanged  except  that  Equa¬ 
tion  (8.1-3b)  becomes 


x5(t1+1)  -  f[xt(t1),u(tj),£] 


1  *  0,1  ... 


(8.1-5) 
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8.2  INITIAL  CONDITIONS 

The  above  derivation  of  the  output-error  method  assumed  that  the  Initial  condition  was  known  exactly. 

This  assumption  Is  seldom  strictly  true,  except  when  using  forms  where  the  Initial  condition  Is  zero  by 
definition. 

The  Initial  condition  Is  typically  based  on  Imperfectly  measured  data.  This  characteristic  suggests 
treating  the  Initial  condition  as  a  random  variable  with  some  mean  and  covariance.  Such  treatment,  however.  Is 
Incompatible  with  the  output-error  method.  The  output-error  method  Is  predicated  on  a  deterministic  solution 
of  the  state  equation.  Treatment  of  a  random  Initial  condition  requires  the  more  complex  filter-error  method 
discussed  later. 

If  the  system  is  stable,  then  Initial  condition  effects  decay  to  a  negligible  level  In  a  finite  time. 

If  this  decay  Is  sufficiently  fast  and  the  error  In  the  Initial  condition  Is  sufficiently  small,  the  Initial 
condition  error  will  have  negligible  effect  on  the  system  response  and  can  be  Ignored. 

If  the  errors  In  the  Initial  condition  are  too  large  to  justify  neglecting  them,  there  are  several  ways  to 
resolve  the  problem  without  sacrificing  the  relative  simplicity  of  the  output-error  method.  One  way  Is  to 
simply  Improve  the  Initial -condition  values.  This  Is  sometimes  trivially  easy  If  the  Initial-condition  value 
Is  computed  from  the  measurement  at  the  first  time  point  of  the  maneuver  (a  cottmon  practice):  change  the  start 
time  by  one  sample  to  avoid  an  obvious  wild  point,  average  the  first  few  data  points,  or  draw  a  fairing  through 
the  noise  and  use  the  faired  value. 

When  these  methods  are  Inapplicable  or  Insufficient,  we  can  Include  the  Initial  condition  In  the  list  of 
unknown  parameters  to  estimate.  The  Initial  condition  Is  then  a  deterministic  function  of  £.  The  solution 
of  the  state  equation  Is  thus  still  a  deterministic  function  of  £  and  time,  as  required  for  the  output-error 
method.  The  equations  of  Section  5.1  still  apply,  provided  that  we  substitute 

*e(t#)  •  x#(£)  (8.2-1) 

for  Equation  (8.3-la). 

It  Is  easy  to  show  that  the  Initial-condition  estimates  have  poor  asymptotic  properties  as  the  time 
interval  Increases.  The  Initial-condition  Information  Is  all  near  the  beginning  of  the  maneuver,  and  Increas¬ 
ing  the  time  Interval  does  not  add  to  this  Information.  Asymptotically,  we  can  and  should  Ignore  Initial  con¬ 
ditions  for  stable  systems.  This  Is  one  case  where  asymptotic  results  are  misleading.  For  real  data  with 
finite  time  intervals  we  should  always  carefully  consider  Initial  conditions.  Thus,  we  avoid  making  the 
mistake  of  one  published  paper  (which  we  will  leave  anonymous)  which  blithely  set  the  model  Initial  condition 
to  zero  In  spite  of  clearly  nonzero  data.  It  is  not  clear  whether  this  was  a  simple  oversight  or  whether  the 
author  thought  that  asymptotic  results  justified  the  practice;  in  any  event,  the  resulting  errors  were  so 
egregious  as  to  render  the  results  worthless  (except  as  an  abject  lesson). 

8.3  COMPUTATIONS 

Equations  (8.1-2)  and  (8.1-3)  define  the  cost  function  that  must  be  minimized  to  obtain  the  MAP  estimates 
(or,  In  the  special  case  that  P_1  is  zero,  the  MLE  estimates).  This  is  a  fairly  complicated  function  of  £. 
Therefore  we  must  use  an  Iterative  minimization  scheme. 

It  is  easy  to  become  overwhelmed  by  the  apparent  complexity  of  J  as  a  function  of  £;  zt(t-|)  Is  Itself 
a  complicated  function  of  £,  Involving  the  solution  of  a  differential  equation.  To  get  J  as  a  function  of 
£  we  must  substitute  this  function  for  Zg(ti)  In  Equation  (8.1-2).  You  might  give  up  at  the  thought  of 
evaluating  first  and  second  gradients  of  this  function,  as  required  by  most  Iterative  optimization  methods. 

The  complexity,  however.  Is  only  apparent.  It  Is  crucial  to  recognize  that  we  do  not  need  to  develop  a 
closed-form  expression,  the  development  of  which  would  b  difficult  at  best.  We  are  only  required  to  develop 
a  workable  procedure  for  computing  the  result. 

To  evaluate  the  gradients  of  J,  we  need  only  proceed  one  step  at  a  time;  each  step  Is  quite  simple, 
Involving  nothing  more  complicated  than  chain-rule  differentiation.  This  step-by-step  process  follows  the 
advice  from  Alice  in  Wonderland: 

The  White  Rabbit  put  on  his  spectacles.  "Where  shall  I  begin,  please  your 
Majesty?"  he  asked. 

"Begin  at  the  beginning,"  the  King  said,  very  gravely,  "and  go  on  till  you 
come  to  the  end:  then  stop." 

8.3.1  Gauss-Newton  Method 


The  cost  function  Is  In  the  form  of  a  sum  of  squares,  which  makes  Gauss-Newton  the  preferred  optimization 
algorithm.  Sections  2.5.2  and  5.4.3  discussed  the  Gauss-Newton  algorithm.  To  gather  together  all  the  Impor¬ 
tant  equations,  we  repeat  the  basic  equations  of  the  Gauss-Newton  algorithm  in  the  notation  of  this  chapter. 
Gauss-Newton  Is  a  quasi-Newton  algorithm.  The  full  Newton-Raphson  algorithm  Is 

tL+1  -  -  CV*0(£L)]-HV*J(£L)]  (8.3-1) 

The  first  gradient  Is 

N 

V(5)  =  '  Z  [2(t1}  '  )3*(6G*)-1(vez(ti )]  +  (£  -  mc)*p-i 

1«i 


(8.3-2) 
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For  the  Gauss-Newton  algorithm,  we  approximate  the  second  gradient  by 

N 

V*JU)  -  £  [vc*e(t1)]*(G6 *)-»Cv€ic(t1)]  +  P-1  (8.3-3) 

1»i 

which  corresponds  to  Equation  (2.5-11)  applied  to  the  cost  function  of  this  chapter.  Equations  (8.3-1) 
through  (8.3-3)  are  the  same,  whether  the  system  Is  In  pure  discrete  time  or  mixed  continuous/discrete  time. 
The  only  quantities  In  these  equations  requiring  any  discussion  are  2^(t^)  and  v(z^(t^). 

8.3.2  System  Response 

The  methods  for  computation  of  the  system  response  depend  cn  whether  the  system  is  pure  discrete  time 
or  mixed  continuous/discrete  time.  The  choice  of  method  Is  alsu  Influenced  by  whether  the  system  1$  linear 
or  nonlinear. 


Computation  of  the  response  of  discrete-time  systems  Is  simply  a  matter  of  plugging  Into  the  equations. 
The  general  equations  for  a  nonlinear  system  are 


*e(t,)  •  x,(c) 

(8.3-4a) 

xt(ti+x)  -  f[*c(ti).u(ti).c] 

1  -  0,1,... 

(8.3-4b) 

2?(tt)  -  g[x£(t1),u(t1),c] 

1  »  1,2,... 

(3.3-4c) 

The  more  specific  equations  for  a  linear  discrete-time  system  are 

*c(t,)  -  x,(c) 

(8.3-5a) 

VW  "  ♦x5(t1)  ♦  fuUj) 

1  -  0.1,... 

(8.3-5b) 

i?(t|)  -  Cx^tj)  ♦  Mtj) 

1  *  l|2fi • « 

(8.3-5C) 

For  mixed  contlnuous/dlscrete-tlme  systems,  numerical  methods  for  approximate  Integration  are  requlreo. 
You  can  use  any  of  nunerous  numerical  methods,  but  the  utility  of  the  more  coa*>11cated  methods  Is  often 
limited  by  the  available  data.  It  makes  little  sense  to  use  a  high-order  method  to  Integrate  the  system 
equations  between  the  time  points  where  the  Input  Is  measured.  The  errors  Implicit  In  Interpolating  the  Input 
measurements  are  probably  larger  than  the  errors  In  the  Integration  method.  For  most  purposes,  a  second-order 


Runge-Kutta  algorithm  Is  probably  an  appropriate  choice: 

W  ■  *,(0  (8.3— 6a) 

x£(t1n)  -  xc(t1)  ♦  (t1+l  -  t^fCX'UjhuCt^.O  (8.3-6b) 

xc(ti+1)  ■  XjU^  +  (t1n  -  tj)  |  (ftx^t^.uU^.O  +  f[x£(t1+1),u(t1+i),c]}  (8.3-6c) 

^(tj)  ■  g[xc(t1),u(t1),t]  (8.3-6d) 


For  linear  systems,  a  transition  matrix  method  Is  more  accurate  and  efficient  than  Equation  (8.3-6). 


xt(t#)  -  x,U) 

(8.3-7a) 

xc(ti+l)  "  Wjt^)  +  »  |  [uU,)  +  u(t1+i)] 

1  *  0,1,... 

(8.3-7b) 

-  C2?(t1)  +  00^) 

1  -  1,2,... 

(8.3-7c) 

a  »  exp[A(ti+J  -  tj)] 

(8.3-8) 

rti+i 

T  •  I  exp(At)dt  B 

(8.3-9) 

Section  7.5  discusses  the  form  of  Equation  (8.3-7b).  Holer  and  Van  Loan  (1978)  describe  several  ways  of 
numerically  evaluating  Equations  (8.3-8)  and  (8.3-9).  In  this  application,  because  tf+x  -  ti  Is  small  com¬ 
pared  to  the  system  natural  periods,  simple  series  expansion  works  well. 


(8.3-10) 
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V  »  A[I  +  t  +  . .  .]B 


(8.3-11) 


where 


t 


L1+i 


(8.3-12) 


8.3.3  Finite  Difference  Response  Gradient 

It  remains  to  discuss  the  computation  of  Vr2r(t<),  the  gradient  of  the  system  response.  There  are  two 
basic  methods  for  evaluating  this  gradient:  finite-difference  differentiation  and  analytic  differentiation. 
This  section  discusses  the  finite  difference  approach,  and  the  next  section  discusses  the  analytic  approach. 

Finite-difference  differentiation  Is  applicable  to  any  model  form.  The  method  Is  easy  to  describe  and 
equally  easy  to  code.  Because  It  Is  easy  to  code,  finite-difference  differentiation  Is  appropriate  for  pro¬ 
grams  where  quick  results  are  needed  or  the  production  workload  Is  small  enough  :hat  saving  program  develop¬ 
ment  time  Is  more  Important  than  Improving  program  efficiency.  Because  It  applies  with  equal  ease  to  all  model 
forms,  finite-difference  differentiation  Is  also  appropriate  for  programs  that  must  handle  nonlinear  models, 
for  which  analytic  differentiation  Is  numerically  complicated  (Jategaonkar  and  Plaetschke,  1983). 


,  .To  use  finite-difference  differentiation,  perturb  the  first  element  of  the  £  vector  by  some  small  amount 
a£u'.  Recompute  the  system  response  using  this  perturbed  £  vector,  obtaining  the  perturbed  system  response 
2p.  The  partial  derivative  of  the  response  with  respect  to  ft1)  Is  then  approximately 


a  MM  ~  MM 

ad1)  ad1) 


(8.3-13) 


Repeat  this  process,  perturbing  each  element  of  £  In  turn,  to  approximate  the  partial  derivatives  with 
respect  to  e«h  element  of  £.  The  finite-difference  gradient  Is  then  the  concatenation  of  the  partial 
derivatives. 


,  i  ,t  i .  r*w 

«  « 1,1  [«<*>  •  K<>)  • 


(8.3-14) 


Selection  of  the  size  of  the  perturbations  requires  some  thought.  If  the  perturbation  Is  too  large. 
Equation  (8.3-13)  becomes  a  poor  approximation  of  the  partial  derivative.  If  the  perturbation  Is  too  small, 
roundoff  errors  become  a  problem. 


Some  people  have  reported  excellent  results  using  simple  perturbatlon-s'ze  rules  such  as  setting  the 
perturbation  magnitude  at  IX  of  a  typical  expected  magnitude  of  the  corresponding  £  element  (assuming  that 
you  understand  the  problem  well  enough  to  be  able  to  establish  such  typical  magnitudes).  You  could  alterna¬ 
tively  consider  percentages  of  the  current  Iteration  estimates  (with  some  special  provision  for  handling  zero 
or  essentially  zero  estimates).  Another  reasonable  rule,  after  the  first  Iteration,  would  be  to  use  percen¬ 
tages  of  the  diagonal  elements  of  the  second  gradient,  raised  to  the  -1/2  power.  As  a  final  resort  (It  takes 
more  computer  time  and  Is  more  complex),  you  could  try  several  perturbation  sizes,  using  the  results  to  gauge 
the  degree  of  nonlinearity  and  roundoff  error,  and  adaptively  selecting  the  best  perturbation  size. 

Due  to  our  limited  experience  with  the  finite  difference  approach,  we  defer  making  specific  recomnenda- 
tlons  on  perturbation  sizes,  but  offer  the  opinion  that  the  problem  Is  amenable  to  reasonable  solution.  A 
little  experimentation  should  suffice  to  establish  an  adequate  perturbation-size  rule  for  a  specific  class  of 
problems.  Note  that  the  higher  the  precision  of  your  computer,  the  more  margin  you  have  between  the  boundaries 
of  linearity  problems  and  roundoff  problems.  Those  of  us  with  60-  and  64-bit  computers  (or  32-bit  computers 
In  double  precision)  seldom  have  serious  roundoff  problems  and  can  use  simple  perturbation-size  rules  with 
Impunity.  If  you  try  to  get  by  with  single  precision  on  a  32-bit  computer,  careful  perturbation-size  selection 
will  be  more  Important. 


8.3.4  Analytic  Response  Gradient 

The  other  approach  to  computing  the  gradient  of  the  system  response  is  to  analytically  differentiate  the 
system  equations.  For  linear  systems,  this  approach  Is  sometimes  far  more  efficient  than  finite  difference 
differentiation.  For  nonlinear  systems,  analytic  differentiation  is  Impractlcally  clumsy  (partially  because 
you  have  to  redo  It  fer  each  new  nonlinear  model  form).  We  will,  therefore,  restrict  our  discussion  of 
analytic  differentiation  to  linear  systems. 

We  first  consider  pure  discrete-time  linear  systems  In  the  form  of  Equation  (8.3-5).  It  Is  crucial  to 
recall  that  we  do  not  need  a  closed  form  for  the  gradient;  we  only  need  a  method  for  computing  It.  A  closed- 
form  expression  would  be  formidable,  unlike  the  following  equation,  which  Is  the  almost  embarassingly  obvious 
gradient  of  Equation  (8.3-5),  obtained  by  using  nothing  more  complicated  than  the  chain  rule: 

V*(t0)  -  V?x((t)  (8.3-13a) 

V£x(tHl)  -  ♦(v5x(t1))  +  (v^x^)  +  (vtv)u( tp  1  -  0,1,...  (8. 3- 13b) 

M<M  -  C(v£x(ti))  +  (v€C)x(t1 )  +  (v^Djuftj)  1  -  1,2,...  U.3-13C) 

Equation  (8.3-13b)  gives  a  recursive  formula  for  Vr.. !t,),  with  Equation  (8.3-13a)  as  the  Initial  condition. 
Equation  (8.3-13c)  expresses  v5z(tf)  In  terms  of  the  solution  of  Equation  (8.3-13b). 
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The  quantities  v^a,  v^t,  ?CC,  and  VrD  In  Equation  (8.3-13)  are  gradients  of  matrices  with  respect  to 
the  vector  c>  The  results  are  vectors,  the  elements  of  which  are  matrices  (If  you  are  fond  of  buzz  words, 
these  are  third-order  tensors).  If  this  starts  to  sound  complicated,  you  will  be  pleased  to  know  that  the 
products  like  (VrD)u(ti)  are  ordinary  matrices  (and  Indeed  sparse  matrlcrs-they  have  lots  of  taro  elements). 
You  can  compute  the  products  directly  without  ever  forming  the  vector  of  matrices  In  your  program.  A  ,  ,-ogram 
to  Implement  Equation  (8.3-13)  takes  fewer  lines  than  the  explanation. 


We  could  write  Equation  (8.3-13)  without  using  gradients  or  matrices.  Simply  replace  by  a/a^U) 
throughout,  and  then  concatenate  the  partial  derivatives  to  get  the  gradient  of  z(ti).  We  tnan  have,  at 
worst,  partial  derivatives  of  matrices  with  respect  to  scalars;  these  partial  derivatives  are  matrices.  The 
only  difference  between  writing  the  equations  with  partial  derivatives  or  gradients  Is  notatlonal.  We  choose 
to  use  the  gradient  notation  because  It  Is  shorter  and  more  consistent  with  the  rest  of  the  book. 


Let  us  look  at  Equation  (8.3-13c)  In  detail  to  see  how  these  equations  would  be  Implemented  In  a  program, 
and  perhaps  to  better  understand  the  equations.  The  left-hand  side  Is  a  matrix.  Each  column  of  the  matrix  1$ 
the  partial  derivative  of  z(tj)  with  respect  to  one  element  of 


^(tl)  "[at*1)  JUl)  ‘  >ct»)  iUl)  ”*  Jp)  iUl)] 


(8.3-14) 


The  quantity  Vrx(ti)  Is  a  similar  matrix,  computed  from  Equation  (8.3-13b);  thus  C(7rx(t^))  Is  a  multiplica¬ 
tion  of  a  matrix  times  a  matrix,  and  this  Is  a  calculation  we  can  handle.  The  quantity  Is  the  vector  of 
matrices 


n  T  a  »C  1 

e  Latf1)  '  ac^J 

and  the  product  (v^C)x(ti)  Is 

i(v  •  ^  Jo  *<v] 


(8.3-15) 


(8.3-16) 


(Our  notation  does  not  Indicate  explicitly  that  this  Is  the  Intended  product  formula,  but  the  other  conceivable 
interpretation  of  the  notation  Is  obviously  wrong  because  the  dimensions  are  Incompatible.  Formal  tensor 
notation  would  make  the  Intention  explicit,  but  we  do  not  really  need  to  Introduce  tensor  notation  here  because 
the  correct  Interpretation  Is  obvious). 

In  many  cases  the  matrix  aC/ac^  will  be  sparse.  Typically  these  matrices  are  either  zero  or  have  only 
one  nonzero  element.  We  can  take  advantage  of  such  sparseness  In  the  computation.  If  C  Is  not  a  function  of 


5(j)  (presumably 


r(j) 


affects  other  of  the  system  matrices),  then  aC/at^  is  a  zero  matrix.  If  only  the 

(k,m)  element  of  C  Is  affected  by  then  [aC/at^]x(tJ  Is  a  vector  with  [aC^k,mVse^hx(t< )^  In  the 

1  M  \  1 

kth  element  and  zeros  elsewhere.  If  more  than  one  element  of  C  Is  affected  by  Eu',  then  the  result  is  a 
sum  of  such  terms.  This  approach  directly  forms  [aC/3E^]x(ti),  taking  advantage  of  sparseness,  Instead  of 


forming  the  full  aC/at 


(J) 


matrix  and  using  a  general -purpose  matrix  multiply  routine.  The  terms  (v£D)u(t<), 
- ,,  — -JJ-—  -  is  a  zero 


(Vr*)x(tj),  and  (ve»)u(tf)  are  all  similar  In  form  to  (v^CJiUti).  The  Initial  condition  v^x0 
matrix  If  x0  1$  known;  otherwise  It  has  a  nonzero  element  for  each  unknown  element  of  x„. 

We  now  know  how  to  evaluate  all  of  the  terms  In  Equation  (8.4-13).  This  Is  significantly  faster  than 
finite  differences  for  some  applications.  The  speed-up  Is  most  significant  if  a,  v,  C,  and  D  are  functions 
of  time  requiring  significant  work  to  evaluate  at  each  point;  stralghforward  finite  difference  methods  would 
have  to  reevaluate  these  matrices  for  each  perturbation. 

Gupta  and  Mehra  (1974)  discuss  a  method  that  Is  basically  a  modification  of  Equation  (8.3-13)  for  comput¬ 
ing  v?z(tj).  Depending  on  the  number  of  Inputs,  states,  outputs,  and  unknown  parameters,  this  method  can 
sometimes  save  computer  time  by  reducing  the  length  of  the  gradient  vector  needed  for  propagation  In 
Equation  (8.4-13). 

We  now  have  everything  needed  to  Implement  the  basic  Gauss-Newton  minimization  algorithm.  Practical 
application  will  typically  require  some  kind  of  start-up  algorithm  and  methods  for  handling  cases  where  the 
algorithm  converges  slowly  or  diverges.  The  Illff-Halne  code,  MMLE3  (Maine  and  Iliff,  1980;  and  Maine,  1981), 
Incorporates  several  such  modifications.  The  line-search  Ideas  (Foster,  1983)  briefly  discussed  at  the  end  of 
Section  2.5.2  also  seem  appropriate  for  handling  convergence  problems.  We  will  not  cover  the  details  of  such 
practical  Issues  here. 

The  discussions  of  singularities  In  Section  5.4.4  and  of  partitioning  In  Section  5.4.5  apply  directly  to 
the  problem  of  this  chapter,  so  we  will  not  repeat  them. 

8.4  UNKNOWN  G 

The  previous  discussion  In  this  chapter  has  assumed  that  the  G-matrix  Is  known.  Equations  (8.1-2) 
and  (8.1-4)  are  derived  based  on  this  assumption.  For  unknown  G,  the  methods  of  Section  5.5  apply  directly. 
Equation  (5.5-2)  substitutes  for  Equation  (8.1-4).  In  the  terminology  of  this  chapter,  Equation  (5.5-2) 
becomes  N 

JU)  -  j  £  [z(tt)  -  ^(t^CGfOG (C)*3"lt*(ti)  -  J^tJ]  +  tn|G(e)G(t)*|  (8.4-1) 

1»i 

If  G  Is  known,  this  reduces  to  Equation  (8.1-4)  plus  a  constant. 
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As  discussed  In  Section  5.S,  the  best  approach  to  minimizing  Equation  (8.4-1)  Is  to  partition  the  param- 
eter  vector  Into  a  part  eg  affecting  G,  and  a  part  affecting  2.  For  each  fixed  G,  the  Gauss-Newton 
equations  of  Section  3.3  apply  to  revising  the  estimate  of  tf.  For  each  fixed  (f,  the  revised  estimate  of  6 
Is  given  by  Equation  (5.5-7),  which  becomes 

N 

GG*  "  4  X  WV  -  I^HMt,)  -  W]*  (8.4-2) 

1-1 


In  the  current  notation.  Section  5.5  describes  the  axial  Iteration  method,  which  alternately  applies  the 
Gauss-Newton  equations  of  Section  8.3  for  Cf  and  Equation  (8.4-2)  for  G. 

The  cost  function  for  estimation  with  unknown  G  Is  often  written  In  alternate  forms.  Although  the  above 
form  Is  usually  the  most  useful  for  confutation,  the  following  forms  provide  some  Insight  Into  the  relations  of 
the  estimators  with  unknown  G  versus  those  with  fixed  G.  When  G  Is  completely  unknown,  the  minimization 
of  Equation  (8.4-1)  Is  equivalent  to  the  minimization  of 


J(0 


Cz (t1 )  -  ic(t1)][z(t1)  -  z^tj)]* 


(8.4-3) 


which  corresponds  to  Equation  (5.5-9).  Section  5.5  derives  this  equivalence  by  eliminating  G.  It  Is  comaon 
to  restrict  G  to  be  diagonal,  In  which  case  Equation  (8.4-3)  becomes 

t 

ju)  -  n 

j-i 

This  form  Is  a  product  of  the  errors  in  the  different  signals.  Instead  of  the  weighted  sum-of-the-errors  form 
of  Equation  (8.1-4). 
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8.5  CHARACTERISTICS 


We  have  shown  that  the  output  error  estimator  Is  a  direct  application  of  the  estimators  derived  In 
Section  5.4  for  nonlinear  static  systems.  To  describe  the  statistical  characteristics  of  output  error  esti¬ 
mates,  we  need  only  apply  the  corresponding  Section  5.4  results  to  the  particular  form  of  output  error. 

In  most  cases,  the  corresponding  static  system  Is  nonlinear,  even  for  linear  dynamic  systems.  Therefore, 
we  must  use  the  forms  of  Section  5.4  Instead  of  the  simpler  forms  of  Section  5.1,  which  apply  to  linear  static 
systems.  In  particular,  the  output  error  ME  and  NAP  estimators  are  both  biased  for  finite  time.  Asymptoti¬ 
cally,  they  are  unbiased  and  efficient. 


From  Equation  (5.4-11),  the  covariance  of  the  NLE  output  error  estimate  Is  approximated  by 


cov(c|t) 


[v5z5(t1)]*(GG*ri[v5z5(t 


”1 


(8.5-1) 


From  Equation  (5.4-12),  the  corresponding  approximation  for  the  posterior  distribution  of  e  In  an  MAP  esti¬ 
mator  Is 


cov(e|Z) 


fN 

[v«{ti)]*(GG*r,[Vc(ti)]  * P_I 
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(8.5-2) 
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CHAPTER  9 


9.0  FILTER  ERROR  METHOD  FOR  DYNAMIC  SYSTEMS 

In  this  chapter,  we  consider  the  parameter  estimation  problem  for  dynamic  systems  with  both  process  and 
measurement  noise.  He  restrict  the  consideration  to  linear  systems  with  additive  Gaussian  noise,  because  the 
exact  analysis  of  more  general  systems  Is  impractical !y  complicated  except  In  special  cases  like  output  error 
(no  process  noise). 

The  easiest  way  to  handle  nonlinear  systems  with  both  measurement  and  process  noise  Is  usually  to  linear¬ 
ize  the  system  and  apply  the  linear  results.  This  method  does  not  give  exact  results  for  nonlinear  systems, 
but  can  give  adequate  approximations  In  some  cases. 

In  mixed  continuous/discrete  time,  the  linear  system  model  Is 

x(t,)  -  x#  (9.0-la) 

x(t)  ■  Ax(t)  +  Bu(t)  +  Fn(t)  (9.0-lb) 

z^)  -  Cx(tj)  +  Du^)  +  Gr^  1  ■  1.2,...  (9.0-lc) 

The  measurement  noise  n  Is  assumed  to  be  a  sequence  of  Independent  Gaussian  random  variables  with  zero  mean 
and  Identity  covariance.  The  process  noise  n  Is  a  zero-mean,  white-noise  process.  Independent  of  the 
measurement  noise,  with  Identity  spectral  density.  The  Initial  condition  x,  Is  assumed  to  be  a  Gaussian 
random  variable.  Independent  of  n  and  n.  with  mean  x«  and  covariance  P,.  As  special  cases,  P#  can  be  0, 
Implying  that  the  Initial  condition  Is  known  exactly;  or  Infinite,  Implying  complete  Ignorance  of  the  Initial 
condition.  The  Input  u  Is  assumed  to  be  known  exactly. 

As  In  the  case  of  output  error,  the  system  matrices  A,  B,  C,  D,  F,  and  G,  are  functions  of  i  and  may 
be  functions  of  time. 

The  corresponding  pure  discrete-time  model  Is 

x(t,)  ■  x0  (9.0-2a) 

X<W  *  *x(ti)  ♦  »u(ti)  ♦  Fni  1-0,1,...  (9.0-2b) 

z(tt)  -  Cx(t^)  +  DuUj)  +6^  1  -  1,2,...  (9.0-2c) 

All  of  the  same  assumptions  apply,  except  that  n  Is  a  sequence  of  Independent  Gaussian  random  variables  with 
zero  mean  and  Identity  covariance. 


9.1  DERIVATION 

In  order  to  obtain  the  maximum  likelihood  estimate  of  t,  we  reed  to  choose  £  to  maximize 
L(t,Z)  -  p(ZnU)  where 

ZN  ■  [z(tl),z(tl)...z(tN)]*  (9.1-1) 

For  the  MAP  estimate,  we  need  to  maximize  p(ZnU)pU).  I"  either  event,  the  crucial  first  step  Is  to  find  a 
tractable  expression  for  p(Z#|0.  Me  will  discuss  three  ways  of  deriving  this  density  function. 

9.1.1  Static  Derivation 

The  first  means  of  deriving  an  expression  for  p(Zu|t)  Is  to  solve  the  system  equations,  reducing  them  to 
the  static  form  of  Equation  (5.0-1).  This  technique,  although  simple  In  principle,  does  not  give  a  tractable 
solution.  We  briefly  outline  the  approach  here  In  order  to  Illustrate  the  principle,  before  considering  the 
more  fruitful  approaches  of  the  following  sections. 

For  a  pure  discrete-time  linear  system  described  by  Equation  (9.0-2),  the  explicit  static  expression  for 
z(ti)  is 

l-i 

l(t})  -  C*1x(t0)  +  C  Yj  ♦1'j'1(fu(tJ)  +  Fn j )  +  Du(ti)  +  Gn1  (9.1-2) 

j-« 

This  Is  a  nonlinear  static  model  In  the  general  form  of  Equation  (5.5-1).  However,  the  separation  of  e 
Into  and  as  described  by  Equation  (5.5-4)  does  not  apply.  Note  thjt  Equation  (9.1-2)  Is  a  nonlinear 
function  of  t,  even  If  the  matrices  are  linear  functions.  In  fact,  the  order  of  nonlinearity  Increases  with 
the  number  of  time  points.  The  use  of  estimators  derived  directly  from  Equation  (9.1-2)  Is  unacceptably  diffi¬ 
cult  for  all  but  the  simplest  special  cases,  and  we  will  not  pursue  It  further. 

For  mixed  contl nuous/dlscrete-tlme  systems,  similar  principles  apply,  except  that  the  u  of  Equa¬ 
tion  (5.0-1)  must  be  generalized  to  allow  vectors  of  Infinite  dimension.  The  process  noise  In  a  mixed 
contlnuous/dlscrete-tlme  system  Is  a  function  of  time,  and  cannot  be  written  as  a  finite-dimensional  random 
vector.  The  material  of  Chapter  5  covered  only  finite-dimensional  vectors.  The  Chapter  5  results  generalize 


nicely  to  Infinite-dimensional  vector  spaces  (function  spaces),  but  mo  will  not  find  that  level  of  abstraction 
nacassary.  Application  to  pura  contlnuous-tlm  systems  would  require  Turthar  generalization  to  allow  Infinite- 
dimensional  observations. 

9.1.2  Oarlvatlon  by  Recursive  Factoring 

Ha  will  now  consldar  a  derivation  based  on  factoring  p(Z^U)  by  means  of  Bayes  rule  (Equation  (3.3-12)). 
The  derivation  applies  either  to  pure  discrete-time  or  mixed  contlnuous/dlscrete-tlme  systems;  the  derivation 
Is  Identical  In  both  cases.  For  the  first  step,  write 

ptfNU)  -  p(x(tN)|ZN.l,t)p(ZN.1|0  (9.1-3) 

Recursive  application  of  this  formula  gives 

N 

P(ZN|0  ■  II  p(*(t1)|Z1.l,0  (9.1-4) 

1»i 

For  any  particular  e.  the  distribution  of  i(t\)  given  Z^.x  Is  known  from  the  Chapter  7  results;  It  Is 
Gaussian  with  mean 

a  EtxU^lZ^.C) 

-  E{Cx(t1)  +  Mtj)  ♦  G^lZ^j.U 

-  Cx^tj)  ♦  Mt,)  (9.1-5) 

and  covariance 

R1  s  cov(x(t1)|Z^_l,{) 

-  cov(Cx(tj)  +  DuUj)  ♦  Gn^lZ^j.t) 

-  CQft^C*  +  GG*  (9.1-6) 

Note  that  lAtt)  and  Mti)  are  functions  of  e  because  they  aru  obtained  from  the  Kalman  filter  based  on  a 
particular  value  of  t;  that  Is,  they  are  conditioned  on  C-  He  use  the  t  subscript  notation  to  emphasize 
this  dependence.  Rj  is  also  a  function  of  C,  although  our  notation  does  not  explicitly  Indicate  this. 

Substituting  the  appropriate  Gaussian  density  functions  characterized  by  Equations  (9.1-5)  and  (9.1-6) 
into  Equation  (9.1-4)  gives 

N 

L(«,Zn)  I  p(ZNU)  -  n  e*p{-  *  [*<V  -  ic(t1)]*R;iCz(t1)  -  zc(tl)]}  (9.1-7) 

1-i 

This  Is  the  desired  expression  for  the  likelihood  functional. 

9.1.3  Derivation  Using  the  Innovation 

Another  derivation  Involves  the  properties  of  the  Innovation.  This  derivation  also  applies  either  to 
mixed  contlnuous/dlscrete-tlme  or  to  pure  discrete-time  systems. 

He  proved  In  Chapter  7  that  the  Innovations  are  a  sequence  of  Independent,  zero-mean  Gaussian  variables 
with  covariances  Ri  given  by  Equation  (7.2-33).  This  proof  was  done  for  the  pure  discrete-time  case,  but 
extends  directly  to  mixed  contlnuous/dlscrete-tlme  systems.  The  Chapter  7  results  assumed  that  the  system 
matrices  were  known;  thus  the  results  are  conditioned  on  c.  The  conditional  probability  density  function  of 
the  Innovations  Is  therefore 


p(VN|t)  -  II  I  ZwRj  |  1/1  exp(-  |  v}R-‘v^  (9.1-8) 

l-i  '  ' 

We  also  showed  In  Chapter  7  that  the  Innovations  are  an  Invertible  linear  function  of  the  observations. 
Furthermore.  It  Is  easy  to  show  that  the  determinant  of  the  Jacobian  of  the  transformation  equals  1.  (The 
Jacobian  Is  triangular  with  l’s  on  the  diagonal).  Thus  by  Equation  (3.4-1),  we  can  substitute 

vt  -  z^)  -  z{(tt)  (9.1-9) 


Into  Equation  (9.1-8)  to  give 

N 

p(ZNU)  -  n  exp{-  \  [Z(tt)  -  2t(t1)]*R;*[z(t1)  -  ^(t,)]} 

1-i 


(9.1-10) 


which  Is  Identical  to  Equation  (9.1-7).  We  see  that  the  derivation  by  Bayes  factoring  and  the  derivation 
using  the  Innovation  give  the  same  result. 


For  many  applications,  we  can  use  the  time  steady-state  Kalman  filter  In  the  cost  functional,  resulting  In 
major  computational  savings.  This  usage  requires,  of  course,  that  the  steady-state  filter  exist.  We  discussed 
the  criteria  for  the  existence  of  the  steady-state  filter  In  Chapter  7.  The  most  Inportent  criterion  Is 
obviously  that  the  system  be  time-invariant.  The  rest  of  this  section  assumes  that  a  steady-state  form  exists. 
When  a  steady-state  form  exists,  two  approaches  can  be  taken  to  justifying  Its  use. 

The  first  justification  Is  that  the  steady-state  form  Is  a  good  approximation  If  the  time  Interval  Is  long 
enough.  The  time-varying  filter  gain  converges  to  the  steady-state  gain  with  time  constants  at  least  as  fast 
as  those  of  the  open-loop  system,  and  sometimes  significantly  faster.  Thus,  If  the  maneuver  analyzed  Is  long 
compared  to  the  system  time  constants,  the  Miter  gain  would  converge  to  the  steady-state  gain  In  a  small  por¬ 
tion  of  the  maneuver  time.  We  could  verify  this  behavior  by  computing  time-varying  gains  for  representative 
values  of  (.  If  the  filter  gain  does  converge  quickly  to  the  steady-state  gain,  then  the  steady-state  filter 
should  give  a  good  approximation  to  the  cost  functional. 

The  second  possible  justification  for  the  use  of  the  steady-state  filter  Involves  the  choice  of  the 
Initial  state  covariance  Pt.  The  time-varying  filter  requires  P»  to  be  specified.  It  Is  a  common  practice 
to  set  P,  to  zero.  This  practice  arises  more  from  a  lack  of  better  Ideas  than  from  any  real  argument  that 
zero  Is  a  good  value.  It  Is  seldom  that  we  know  the  Initial  state  exactly  as  Implied  by  the  zero  covariance. 
One  circumstance  which  would  justify  the  zero  Initial  covariance  would  be  the  case  where  the  Initial  condition 
Is  Included  In  the  list  of  unknown  parameters.  In  this  case,  the  Initial  covariance  Is  properly  zero  because 
the  filter  Is  conditioned  on  the  values  of  the  unknown  parameters.  Any  p-lor  Information  about  the  Initial 
condition  Is  then  reflected  In  the  prior  distribution  of  t  Instead  of  In  P,.  Unless  one  has  a  specific  need 
for  estimates  of  the  Initial  condition,  there  are  usually  better  approaches. 

We  suggest  that  the  steady-state  covariance  Is  often  a  reasonable  value  for  the  Initial  covariance.  In 
this  case,  the  time-varying  and  steady-state  filters  are  Identical;  arguments  about  the  speed  of  convergence 
and  the  length  of  the  data  Interval  are  not  required.  Since  the  time-varying  form  requires  significantly  more 
computation  than  the  steady-state  form,  the  steady-state  form  Is  preferable  except  where  It  Is  clearly  and 
significantly  Inferior. 

If  the  steady-state  filter  Is  used.  Equation  (9.1-7)  becomes 

N 

IU,ZN)  -  II  I2WRI-1/*  exp{[z(tj)  -  zc(t1)]*R'1[z(t1)  -  ie(t1 ))}  (9.1-11) 

1*i 

where  R  Is  the  steady-state  covariance  of  the  Innovation.  In  general,  R  Is  a  function  of  e.  The  Zr(tj) 

In  Equation  (9.1-11)  comes  from  the  steady-state  filter,  unlike  the  ir(tj)  In  Equation  (9.1-7).  We  use  the 
same  notation  for  both  quantities,  distinguishing  them  by  context.  (The  z£(tj)  from  the  steady-state  filter 
Is  always  associated  with  the  steady-state  covariance  R,  whereas  the  2g(t^)  from  the  t<me-vary1ng  filter  Is 
associated  with  the  time-varying  covariance  Ri.) 

9.1.5  Cost  Function  Discussion 

The  maximum-likelihood  estimate  of  (  Is  obtained  by  maximizing  Equation  (9.1-11)  (or  Equation  (9.1-7) 

If  the  steady-state  form  Is  Inappropriate)  with  respect  to  £. 

Because  of  the  exponential  In  Equation  (9.1-11),  It  Is  more  convenient  to  work  with  the  logarithm  of  the 
likelihood  functional,  called  the  log  likelihood  functional  for  short.  The  log  likelihood  functional  Is 
maximized  by  the  same  value  of  c  that  maximizes  the  likelihood  functional  because  the  logarithm  Is  a  mono¬ 
tonic  Increasing  function.  By  convention,  most  optimization  theory  Is  written  In  terms  of  minimization  Instead 
of  maximization.  We  therefore  define  the  negative  of  the  log  likelihood  functional  to  be  a  cost  functional 
which  Is  to  be  minimized.  We  also  omit  the  m(2w)  term  from  the  cost  functional,  because  It  does  not  affect 
the  minimization.  The  most  convenient  expression  for  the  cost  functional  Is  then 

N 

j(o  - }  £  [z(V  *  -  VV3  +  i N  ln|R|  (9.i-i2) 

1-1 

If  R  Is  known,  then  Equation  (9.1-12)  Is  In  a  least-squares  form.  This  Is  sometimes  called  a  prediction- 
error  form  because  the  quantity  being  minimized  Is  the  square  of  the  one-step-ahead  prediction  error 
z(ti )  -  z£(t<).  The  term  "filter  error"  Is  also  used  because  the  quantity  minimized  Is  obtained  from  the 
Kalman  filter. 

Note  that  this  form  of  the  likelihood  functional  Involves  the  Kalman  fllter-not  a  smoother.  There  Is 
sometimes  a  temptation  to  replace  the  filter  In  this  cost  function  by  a  smoother,  assuming  that  this  will  give 
Improved  results.  The  smoother  gives  better  state  estimates  than  the  filter,  but  the  problem  considered  In 
this  chapter  Is  not  state  estimation.  The  state  estimates  are  an  Incidental  side-product  of  the  algorithm  for 
estimating  the  parameter  vector  c.  There  are  ways  of  deriving  and  writing  the  parameter  estimation  problem 
which  Involve  smoothers  (Cox  and  Bryson,  1980),  but  the  direct  use  of  a  smoother  In  Equation  (9.1-12)  Is 
simply  Incorrect. 

For  MAP  estimates,  we  modify  the  cost  functional  by  adding  the  negative  of  the  logarithm  of  the  prior 
probability  density  of  C.  If  the  prior  distribution  of  $  Is  Gaussian  with  mean  mg  and  covariance  W,  the 
cost  functional  of  Equation  (9.1-12)  becomes  (Ignoring  constant  terms) 
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N 

0(5)  ■  7  £  [ity,  -  ic(t1)]*R“»C*(t1)  *  2c(t1)J  +  }  N  tn|R|  ♦  {  <«  -  ■c>*WM(e  -  m^)  (9.1-13) 

l-i 

The  filter-error  forms  of  Equations  (9.1-12)  and  (9.1-13)  are  parallel  to  the  output-error  forms  of 
Equations  (8.1-4)  and  (8.1-2).  When  there  Is  no  process  noise,  the  steady-state  Kalman  filter  becomes  an 
Integration  of  the  system  equations,  and  the  Innovation  covariance  R  equals  the  measurement  noise  covariance 
GG*.  Thus  the  output  error  equations  of  the  previous  chapter  are  special  cases  of  the  filter  error  equations 
with  zero  process  noise. 


9.2  COMPUTATION 

The  best  methods  for  minimizing  Equation  (9.1-12)  or  (9.1-13)  are  based  on  the  Gauss-Newton  algorithm. 
Because  these  equations  are  so  similar  In  form  to  the  output-errT  equations  of  Chapter  8,  most  of  the  Chap¬ 
ter  8  material  on  computation  applies  directly  or  with  only  minor  modification. 

The  primary  differences  between  computational  methods  for  filter  error  and  those  for  output  error  center 
on  the  treatment  of  the  noise  covariances,  particularly  when  the  covariances  are  unknown.  Maine  and  II Iff 
(1981a)  discuss  the  Implementation  details  of  the  filter-error  algorithm.  The  Illff-Malne  code,  MMLE3  (Maine 
and  II Iff ,  1980;  and  Maine,  198l),  Implements  the  filter-error  algorithm  for  linear  contl nuous/dlscrete-tlme 
systems . 

We  generally  presume  the  use  of  the  steady-state  filter  In  the  filter-error  algorithm.  Implementation  Is 
significantly  more  complicated  using  the  time-varying  filter. 


9.3  FORMULATION  AS  A  FILTERING  PROBLEM 

An  alternative  to  the  direct  approach  of  the  previous  section  Is  to  recast  the  parameter  estimation  prob¬ 
lem  Into  the  form  of  a  filtering  problem.  The  techniques  of  Chapter  7  then  apply. 

Suppose  we  start  with  the  system  model 

*(t#)  -  x0  (9.3-la) 

x(t)  -  A(Ox(t)  ♦  BU)u(t)  ♦  Fn (t)  (9.3-lb) 

1(^5  -  C(t)x(t1)  +  D(t)u(ti)  +  Gnj  (9.3-lc) 

This  is  the  same  as  Equation  (9.0-1),  except  that  here  we  explicitly  Indicate  the  dependence  of  the  matrices 
on  t.  The  problem  <s  to  estimate  t. 

In  order  to  apply  state  estimation  techniques  to  this  problem,  C  must  be  part  of  the  state  vector. 
Therefore,  we  define  an  augmented  state  vector 


We  can  combine  Equation  (9.3-1)  with  the  trivial  differential  equation 

i  -  0  (9.3-3) 


to  write  a  system  equation  with  xa  as  the  state  vector.  Note  that  the  resulting  system  Is  nonlinear  In  xa 
(because  It  has  products  of  c  and  x),  even  though  Equation  (9.3-1)  Is  linear  In  x. 

In  principle,  we  can  apply  the  extended  Kalman  filter,  discussed  In  Section  7.7,  to  the  problem  of  esti¬ 
mating  xa.  Unfortunately,  the  nonlinearity  In  the  augmented  system  Is  crucial  to  the  system  behavior.  The 
adequacy  of  the  extended  Kalman  filter  for  this  problem  has  seldom  been  analyzed  In  detail.  Schweppe  (1973, 
p.  433)  says  on  this  subject 

...the  system  Identification  problem  has  been  transformed  into  a  problem 
which  has  already  been  discussed  extensively. 

The  discussions  are  not  terminated  at  this  point  for  the  simple  reason  that 
Part  IV  did  not  provide  any  "best"  one  way  to  solve  a  nonlinear  state  esti¬ 
mation  problem.  A  major  conclusion  of  Part  IV  was  that  the  best  way  to 
proceed  depends  heavily  on  the  explicit  nature  of  the  problem.  System 
Identification  leads  to  special  types  of  nonlinear  estimation  problems,  so 
specialized  discussions  are  needed. 

...the  state  augmentation  approach  Is  not  emphasized,  as  the  author  feels 
that  It  Is  much  more  appropriate  to  approach  the  system  Identification 
problem  directly.  However,  there  are  special  cases  where  state  augmentation 
works  very  well . 


.  wv :in;wn W  JL  V  ,lw«  w.  T1"^"  *.t%  J*TWVMSTO^«-  ■*-vi  ■ 
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CHAPTER  10 


10.0  EQUATION  ERROR  METHOD  FOR  DYNAMIC  SYSTEMS 

This  chapter  discusses  the  equation  error  approach  to  parameter  estimation  for  dynamic  systems.  We  will 
first  define  a  restricted  form  of  equation  error,  parallel  to  the  treatments  of  output  error  and  filter  error 
In  the  previous  chapters.  This  form  of  equation  error  Is  a  special  case  of  filter  error  where  there  Is  process 
noise,  but  no  measurement  noise.  It  therefore  stands  In  counterpoint  to  output  error,  which  Is  the  special 
case  where  there  Is  measurement  noise,  but  no  process  noise. 

We  will  then  extend  the  definition  of  equation  error  to  a  more  general  form.  Some  of  the  practical  appli¬ 
cations  of  equation  error  do  not  fit  precisely  Into  the  overly  restrictive  form  based  on  process  noise  only. 

In  Its  most  general  forms,  the  term  equation  error  encompasses  output  error  and  filter  error.  In  addition  to 
the  forms  most  commonly  associated  with  the  term.  The  primary  distinguishing  feature  of  the  methods  emphasized 
In  this  chapter  Is  their  computational  simplicity. 


10.1  PROCESS-NOISE  APPROACH 

in  this  section,  we  consider  equation  error  In  a  manner  parallel  to  the  previous  treatments  cf  output 
error  and  filter  error.  The  filter-error  method  treats  systems  with  both  process  noise  and  measurement  noise, 
and  output  error  treats  the  special  case  of  systems  with  measurement  noise  only.  Equation  error  completes  this 
triad  of  algorithms  by  treating  the  special  case  of  systems  with  process  noise  only. 

The  equation-error  method  applies  to  nonlinear  systems  with  additive  Gaussian  process  noise.  We  will 
restrict  the  discussion  of  this  section  to  pure  discrete-time  models,  for  which  the  derivation  Is  straightfor¬ 
ward.  Mixed  continuous/di screte-time  models  can  be  handled  by  converting  them  to  equivalent  pure  discrete-time 
models.  Equation  error  does  not  strictly  apply  to  pure  continuous-time  models.  (The  problem  becomes 
ill -posed). 

The  general  form  of  the  nonlinear,  discrete-time  system  model  we  will  consider  Is 


x(t„)  =  x0  (10.1-la) 

x(t1+1)  =  f(x(t1),u(t1),0  +  Fo1  1  =  0,1 . N  -  1  (10.1-lb) 

z^)  =  g[x(t1),u(ti),0  i  =  0,1 . N  (10.1-lc) 


The  process  noise,  n.  Is  a  sequence  of  Independent  Gaussian  random  variables  with  zero  mean  and  Identity 
covariance.  The  matrix  F  can  be  a  function  of  5,  although  the  simplified  notation  Ignores  this  possibility. 

It  will  prove  convenient  to  assume  that  the  measurements  z{ t-j )  are  defined  for  1  =  0 . N;  previous 

chapters  have  defined  them  only  for  1  =  1,...,N. 

10.1.1  Derivation 

The  following  derivation  of  the  equation-error  method  closely  parallels  the  derivation  of  the  filter-error 
method  in  Section  9.1.3.  Both  are  based  primarily  on  application  of  the  transformation  of  variables  formula, 
Equation  (3.4-1),  starting  from  a  process  known  to  be  a  sequence  of  independent  Gaussian  variables. 

By  assumption,  the  probability  density  function  of  the  process  noise  is 

N-i 

p(nN)  =  IT  (2tt)-1/2  exp(n^n.)  (10.1-2) 

i=o 

where  on  Is  the  concatenation  of  the  n^.  We  further  assume  that  F  is  invertible  for  all  permissible 
values  of  $•,  this  assumption  is  necessary  to  ensure  that  the  problem  is  well-posed.  We  define  Xr  to  be  the 
concatenation  of  the  x(t^).  Then,  for  each  value  of  5,  Xr  is  an  invertible  linear  function  of  njj.  The 
inverse  function  is 

ni  =  F"l[x(ti+i)  -  x?(t1+l)]  (10.1-3) 

where,  for  convenience  and  for  consistency  with  the  notation  of  previous  chapters,  we  have  defined 

x?(ti+1)  =  f[x(t1),u(t1),e]  (10.1-4) 

-N 

The  determinant  of  the  Jacobian  of  the  inverse  transformation  Is  |F|"  because  the  inverse  transformation 
matrix  is  block-triangular  with  F'1  In  the  diagonal  blocks.  Direct  application  of  the  transformation-of- 
variables  formula,  Equation  (3.4-1),  gives 

N 

p(XN|e)  -  n  |2itFF*|-1/2  exp{- j  tx(t.)  -  ^ ( t1  )]*( FF*)“ 1  (x^)  -  x^)]}  (10.1-5) 

1=i 


In  order  to  derive  a  simple  expression  for  p(Z^|c),  we  require  that  g  be  a  continuous.  Invertible  func¬ 
tion  of  x  for  each  value  of  £.  The  invertibll Ity  is  critical  to  the  simplicity  of  the  equation-error 
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algorithm.  This  assumption,  combined  with  the  lack  of  measurement  noise,  means  that  we  can  reconstruct  the 
state  vector  perfectly,  provided  that  we  know  £.  The  Inverse  function  gives  this  reconstruction: 

VV  "  g’l[z(t1),u(ti),e]  (10.1-6) 

If  g  Is  not  Invertible,  a  recursive  state  estimator  becomes  Imbedded  In  the  algorithm  and  we  are  again  faced 
with  something  as  complicated  as  the  filter-error  algorithm.  For  Invertible  g,  the  transformatlon-of- 
varlables  formula.  Equation  (3.4-1),  gives 


|2wFF*|"l/2  exp{-  \  [x^V  -  x£(t1)]*(FF*)'1[xc(t1)  -  x^)]} 

(10.1-7) 

where  X;(ti)  Is  given  by  Equation  (10.1-6),  and 

VV  -  T[xc(t1_1),u(ti_1),e]  (10.1-8) 

Most  practical  applications  of  equation  error  separate  the  problems  of  state  reconstruction  and  parameter 
estimation.  In  the  context  defined  above,  this  Is  possible  when  g  Is  not  a  function  of  6*  Then  Equa¬ 
tion  (10.1-6)  Is  also  Independent  of  6;  thus,  we  can  reconstruct  the  state  exactly  without  knowledge  of  £. 
Furthermore,  the  estimates  of  £  depend  only  on  the  reconstructed  state  vector  and  the  control  vector.  There 
Is  no  direct  dependence  on  the  actual  measurements  z(ti )  or  on  the  exact  form  of  the  g-functlon.  This  Is 
evident  In  Equation  (10.1-7)  because  the  Jacobian  of  g'1  Is  Independent  of  £  and,  therefore,  Irrelevant  to 
the  parameter-estimation  problem.  In  many  practical  applications,  the  state  reconstruction  Is  more  complicated 
than  a  simple  polntwlse  function  as  In  Equation  (10.1-6),  but  as  long  as  the  state  reconstruction  does  not 
depend  on  £,  the  details  do  not  matter  to  the  parameter-estimation  process. 

You  will  seldom  (If  ever)  see  Equation  (10.1-7)  elsewhere  In  the  form  shown  here,  which  Includes  the  fac¬ 
tor  for  the  Jacobian  of  g-1.  The  usual  derivation  Ignores  the  measurement  equation  and  starts  from  the 
assumption  that  the  state  Is  known  exactly,  whether  by  direct  measurement  or  by  some  reconstruction.  Me  have 
Included  the  measurement  equation  only  In  order  to  emphasize  the  parallels  between  equation  error,  output 
error,  and  filter  error.  For  the  rest  of  this  section,  we  will  assume  that  g  Is  Independent  of  £.  We  will 
specifically  assume  that  the  determinant  of  the  Jacobian  of  g  Is  1  (the  actual  value  being  Irrelevant  to  the 
estimator  anyway),  so  that  we  can  write  Equation  (10.1-7)  In  a  more  conventional  form  as 

N 

P(Zn|£)  -  II  1 2irFF* | -1/2  exp j-  \  [x(t,)  -  «c(ti)]*(FF*)-1tx(ti)  -  x^)]}  (10.1-9) 

1*i 

where 

MV  =  f[x(t1_l),u(t._l),£]  (10.1-10) 

You  can  derive  slight  generalizations,  useful  In  some  cases,  from  Equation  (10.1-7). 

The  maximum-likelihood  estimate  of  £  Is  the  value  that  maximizes  Equation  (10.1-9).  As  In  previous 
chapters.  It  Is  convenient  to  work  In  terms  of  minimizing  the  negatlve-log-llkellhood  functional 

N 

J(e)  =  \  Yj  MV  -  2e(ti)]*(FF*)-1tx(ti)  -  VV]  +  \  N  4n I FF* |  (10.1-11) 

1=i 

If  £  has  a  Gaussian  pMor  distribution  with  mean  m^  and  covariance  P,  then  the  MAP  estimate  minimizes 
N 

J(?)  =  j  Y  [x(V  "  *c(ti)]*(FF*)-»[x(ti)  -  xe(tl)]  +  \  N  in | FF* |  +  \  (£  -  -  m^) 

1-1  (10.1-12) 


P(ZJe)  *  II  det 


3g'1[z(t1),u(t1),£] 

3z(t,) 


10.1.2  Special  Case  of  Filter  Error 


For  linear  systems,  we  can  also  derive  state-equation  error  by  plugging  Into  the  linear  filter-error 
algorithm  derived  In  Chapter  9.  Assume  that  G  is  0;  FF*  Is  invertible;  C  Is  square.  Invertible,  and  known 
exactly;  and  D  Is  known  exactly.  These  are  the  assumptions  that  mean  we  have  perfect  measurements  of  the 
state  of  the  system. 


The  Kalman  filter  for  this  case  is  (repeating  Equation  (7.3-11)) 


xtt^  =  C'1[z(t1)  -  0u(ti)]  =  x(t1) 


(10.1-13) 


and  the  covariance,  Pi,  of  this  filtered  estimate  Is  0.  The  one-step-ahead  prediction  is 

x(t1+l)  =  *x(t1)  +  fuUj) 


(10.1-14) 


Q1  -  FF* 

From  Equation  (9.1-6)  we  have 

R1  -  CFF*C* 

and  thus  Equation  (9 .1-12)  becomes 

N 

0(0  -  |  £  Cz(t1}  '  *eCt1  > J*<CFF*c*)"1C*(t1 )  -  2^)]  +  |  N  tn | CFF*C* | 

1-i 

Eliminating  Irrelevant  |C|  constants,  we  can  redefine  the  cost  function  as 

N 

J(0  “  \  Yj  tx{t1}  ‘  *e(^1)3*(FF*)-xCx(t1)  -  x^tj)]  +  \  N  »n|FF*|  (10.1-18) 

1-i 

which  Is  In  the  form  of  Equation  (10.1-11).  Note  that  C  and  D  play  no  role  In  this  estimator,  outside  of  the 
reconstruction  of  the  state  using  Equation  (10.1-13). 

10.1.3  Discussion 

The  cost  function  defined  by  Equation  (10.1-11)  or  (10.1-1?)  Involves  a  weighted  square  sum  of  the  error 
that  would  be  In  the  state  equation.  Equation  (10.1-lb),  If  the  noise  term  were  omitted.  The  term  "equation 
error"  derives  from  this  fact.  This  terminology  Is  rather  vague,  giving  little  hint  as  to  what  equation  Is 
meant.  The  output-error  and  filter-error  methods  described  In  previous  chapters  could,  with  equal  validity,  be 
categorized  as  methods  Involving  minimizing  the  error  of  some  equation.  In  spite  of  this  potential  ambiguity, 
the  use  of  the  term  "equation  error"  Is  well-established,  and  the  term  Is  unlikely  to  be  misinterpreted.  The 
terms  "state-equation  error"  and  "observation-equation  error,"  which  we  use  In  the  following  sections,  are  more 
definitive,  but  not  widely  used. 

The  equation-error  method  Is  also  referred  to  by  several  other  names.  The  term  "least  squares"  Is  some¬ 
times  used  to  define  the  method,  but  this  terminology  is  subject  to  misinterpretation.  The  large  majority  of 
the  estimation  methods  used  can  be  classified  as  least-squares  methods.  We  suggest  using  the  term  "least 
squares"  only  to  refer  to  this  broad  class  of  methods  (as  In  the  statement  "equation  error  Is  a  least  squares 
method"),  never  to  precisely  specify  a  method.  The  term  "linear  least  squares"  Is  somewhat  more  definitive  (at 
least  for  the  case  In  which  f  Is  a  linear  function  of  5)  and  has  been  used  cn  occasion.  Another  term  often 
used  is  "regression"  method  (or,  more  definitively,  "linear  regression"). 

The  terms  "equation  error"  or  "least  squares"  are  often  used  to  contrast  this  method  with  maximum- 
likelihood  estimators.  Such  contrasts  are  Inappropriate  and  misleading  because  equation  error  Is  a  completely 
rigorous  maximum-likelihood  estimator  for  the  problem  as  stated.  The  differences  between  equation  error, 
output  error,  and  filter  error  lie  in  the  problem  statements  and  assumptions,  not  In  the  statistical  principles 
used  nor  In  the  rigor  of  the  derivation.  To  disparage  equation  error  on  the  basis  that  it  Is  not  maximum 
likelihood  because  It  Ignores  measurement  noise  smacks  more  of  snobbery  than  of  honest  evaluation.  The  neglect 
of  measurement  noise  may,  Indeed,  be  a  significant  flaw  for  some  applications,  but  this  flaw  Is  Irrelevant  to 
the  Issue  of  whether  equation  error  is  maximum  likelihood. 

A  related  common  misconception  is  that  equation-error  estimates  are  biased,  whereas  output-error  or 
filter-error  estimates  are  asymptotically  unbiased.  To  the  contrary,  equation  error  is  asymptotically  unbiased 
for  the  problem  as  stated;  In  many  applications,  the  equation-error  estimates  are  even  unbiased  for  finite 
time.  It  is  true  that  equation  error  Is  biased  in  the  presence  of  measurement  noise,  but  output  error  Is 
likewise  biased  In  the  presence  of  process  noise. 

The  principle  Illustrated  here  is  universal:  any  estimator  Is  biased  (among  other  problems)  when  applied 
to  systems  that  violate  the  assumptions  used  In  deriving  the  estimator.  This  principle  applies  to  all  assump¬ 
tions,  not  just  to  the  presence  or  absence  of  noise.  Because  any  real  system  will  violate  any  tractable  set  of 
assumptions,  all  estimators  are  actually  biased.  (All  of  our  previous  statements  that  given  estimators  are 
unbiased  are  based  on  Idealized  systems  meeting  the  stated  assumptions.) 

The  unqualified  statement  that  a  given  estimator  Is  biased  is,  therefore,  of  little  use  in  evaluating  the 
estimator.  More  pertinent  Issues  include  the  questions  of  which  assumptions  are  most  severely  violated  by  the 
actual  system,  and  how  sensitive  the  estimator  Is  to  these  violations.  The  magnitude  of  the  bias  Is  a  reason¬ 
able  means  of  addressing  these  questions,  but  the  mere  existence  of  a  bias  Is  not. 


10.2  GENERAL  EQUATION  ERROR  FORM 

Many  practical  applications  of  equation  error  do  not  fit  naturally  Into  the  restrictive  definition  of  the 
previous  section,  which  allows  no  measurement  noise.  There  are  several  alternate  definitions  of  equation  error 
that  accommodate  these  applications.  These  alternate  definitions  Involve  apparently  disparate  statistical 
assumptions.  The  unifying  theme,  which  justifies  the  use  of  the  same  terminology  and  computational  tools  for 
these  various  cases,  Is  the  form  of  the  resulting  cost  function.  In  some  cases,  two  different  viewpoints  and 
corresponding  different  assumptions  about  the  same  application  can  result  In  Identical  computations. 


(10.1-15) 

(10.1-16) 

(10.1-17) 
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Me  Mill,  therefore,  take  the  cost-function  form  as  the  general  defining  property  of  equation-error  estl- 
matdrs.  This  form  can  arise  from  several  different  sets  of  statistical  assumptions.  This  nonstatlstlcal , 
result-oriented  approach  to  the  definition  helps  us  to  avo<  unnaturally  contorting  some  problem  statements 
to  force  them  to  fit  an  overly  rigid  definition,  when  a  more  natural  problem  statement  achieves  the  same 
result. 

To  define  a  general  equation-error  estimator,  we  start  with  some  equation,  expressed  as  a  function  of  the 
measurements  and  the  unknown  parameters,  which  should  Ideally  (Ignoring  noise  and  modeling  errors)  be  satisfied 
at  every  measurement  t;»e  point.  We  write  the  equation  In  the  general  form 

h[z(.).ii(-).t1.c]  -  0  1  -  1,2,. ...N  (10.2-1) 

Sections  10.2.1  through  10.2.3  give  specific  comnon  cases  of  such  equations. 

The  equation- error  estimate  based  on  this  equation  Is  then  the  value  of  £  that  minimizes  the  cost 
function 

N 

J(C)  -  |  Y  hW.),u(.),t1,€]  (10.2-2) 

1*i 

where  W  Is  a  positive  semldeflnlte  weighting  matrix.  The  definition  assumes  that  the  minimum  exists  and  Is 
unique. 

In  order  to  accommodate  prior  Information,  and  unknown  W  matrices,  we  allow  the  form  of  Equation  (10.2-2) 
to  be  extended  to 


N 

0(e)  *  \  Y  h[z(.)»u(.),tj,c]*K  h[z(.),u(.),t1,£]  +  \  N  jtn|W-1 1  +  \  (£  -  mt)*P(£  -  m?) 

W  (10.2-3) 

corresponding  to  Equation  (10.1-12).  The  above  definition  Is  broad  enough  to  Include  output  error  and  filter 
error,  as  well  as  the  equation-error  estimators  defined  In  Section  10.1. 

The  estimators  emphasized  In  this  chapter  have  the  particular  property  that  the  h  dependence  on  z(.) 
and  u(.)  Is  restricted  to  one  or  two  time  points.  The  central  statistical  assumption  that  gives  this  property 
Is  that  there  are  perfect  (no  noise)  measurements  of  the  state.  This  assumption  reduces  the  Kalman  filter  to 
the  form  of  Equation  (10.1-3),  which  eliminates  the  Integration  of  the  state  equation.  With  this  assumption. 
Equation  (10.1-3)  Is  the  obvious  optimal  filter  even  for  nonlinear  state  equations.  We  are  also  forced  to 
assume  that  the  process  noise  covariance  FF*  Is  nonsingular;  a  singular  FF*  combined  with  the  perfect 
state  measurements  would  give  an  111 -posed  problem. 

10.2.1  Discrete  State-Equation  Error 

One  specific  case  of  the  equation-error  method  Is  state-equation  error.  In  this  case,  the  specific  form 
of  Equation  (10.2-1)  derives  from  the  state  equation.  Ignoring  the  process  noise.  We  will  first  consider 
state-equation  error  for  discrete-time  systems.  The  discrete-time  state  equation  for  a  general  nonlinear 


system.  Ignoring  the  process  noise,  Is 

x(ti+1)  -  f[x(t1),u(t1).0  1  -  0,1 . N  -  1  (10.2-4) 

The  h  function  based  on  this  equation  is 

h[z(.).u(.),ti.c]  -  xUj)  -  f[x(t1_l),u(t1.1),£]  1  «  1,2,. ..N  (10.2-5) 


This  form  presumes  that  the  x(t-j )  can  be  reconstructed  as  a  function  of  the  z(t-|)  and  u(t^). 

We  recognize  discrete-time  state-equation  error  as  the  method  derived  in  Section  10.1.  Equation  (10.1-12) 
(with  Equation  (10.1-10))  Is  a  special  case  of  Equation  (10,2-3)  using  Equation  (10.2-5)  for  h  and  FF*  for 
W.  Section  10.1  discussed  the  details  of  the  statistical  assumptions  Implicit  In  this  form. 

Note  also  that  we  can  define  a  state-equation  error  method  whether  or  not  the  state  measurements  are 
noise-free.  The  only  requisite  for  a  plausible  state-equation  error  method  Is  that  we  have  some  estimate  of 
the  state  to  use  In  Equation  (10.2-5).  If  the  measurements  are  contaminated  with  noise,  then  the  estimator  is 
not  a  maximum-likelihood  estimator  and  will  be  asymptotically  biased.  There  are  many  practical  circumstances, 
however,  where  a  simple  equation-error  estimator  Is  preferable  to  the  "optimal"  alternatives. 

10.2.2  Continuous/Discrete  State-Equation  Error 

For  a  mixed  contlnuous/dlscrete-time  system  with  additive  process  noise,  the  state  equation  is 

x(t)  -  f[x(t),u(t),£]  +  Fn(t)  (10.2-6) 

The  h  function  for  a  contlnuous/dlscrete-time  state-equation  error  method  derives  from  evaluating  the  state 
equation  at  the  measurement  times  t^  and  Ignoring  the  process  noise: 

h[z(.),u(.),t1,£]  ■  x^)  -  fCx(t1),u(t1),£] 


(10.2-7) 
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The  use  of  this  form  In  an  equation-error  method  presumes  that  the  state  x(tj)  can  be  reconstructed  as  a 
function  of  the  z(t{)  and  u(ti).  This  presumption  Is  Identical  to  that  for  discrete-time  state-equation 
error,  and  It  Implies  the  same  conditions:  there  must  be  noise-free  measurements  of  the  state,  Independent  of 
t.  It  Is  Imollclt  that  a  known  Invertible  transformation  of  such  measurements  Is  statistically  equivalent.  As 
In  the  discrete-time  case,  we  can  define  the  estimator  even  when  the  measurements  are  noisy,  but  It  will  no 
longer  be  a  maxi mum- likelihood  estimator. 

Equation  (10.2-7)  also  presumes  that  the  derivative  x(t<)  can  be  reconstructed  from  the  measurements. 
Neglecting  for  the  moment  the  statistical  Implications,  note  that  we  can  form  a  plausible  equation-error  esti¬ 
mator  using  any  reasonable  means  of  approximating  a  value  for  x(t^)  Independently  of  £•  The  simplest  case  of 
this  Is  when  the  observation  vector  Includes  measurements  of  the  state  derivatives  In  addition  to  the  measure¬ 
ments  of  the  states.  If  such  derivative  measurements  are  not  directly  available,  we  can  always  approximate 
x(ti)  by  finite-difference  differentiation  of  the  state  measurements,  as  In 


xU^ 


x(ti+1)  - 

l1+i  ‘  t1-i 


(10.2-8) 


Both  direct  measurement  and  finite-difference  approximation  are  used  In  practice. 

Rigorous  statistical  treatment  Is  easiest  for  the  case  of  finite-difference  approximations.  To  arrive  at 
such  a  form,  we  write  the  state  equation  In  Integrated  form  as 


x(t1+i)  ■  x(t1)  + 


f[x(t),u(t),«]dt  + 


Fn(t)dt 


(10.2-9) 


An  approximate  solution  (not  necessarily  the  best  approximation)  to  Equation  (10.2-9)  Is 

x(t1+i)  a  x(ti)  +  (ti+1  -  t1)f[x(ti),u(t1),£]  4  F<jn1  (10.2-10) 

wh«re  n^  Is  a  sequence  of  Independent  Gaussian  variables,  and  F^  Is  the  equivalent  discrete  F-matrlx. 
Sections  6.2  and  7.5  discuss  such  approximations. 

Equation  (10.2-10)  Is  In  the  form  of  a  discrete-time  state  equation.  The  discrete-time  state-equation 
error  method  based  on  this  equation  uses 

h[z(.),u(.),tj,E]  =  x(^)  -  x(t1ml)  -  (t1  -  t1x)ftx(t1_l) ,u(ti_i) ,e]  (10.2-11) 


Redefining  h  by  dividing  by  t^  -  t-j _ j  gives  the  form 

h[z(.),u(.),t.|,c]  *  xUj)  -  ftxU^.uU^.e]  (10.2-12) 


where  the  derivative  Is  obtained  from  the  finite-difference  formula 


x(tj) 


x(t.,)  -  xft^) 

*1  "  *1-i 


(10.2-13) 


Other  discrete-time  approximations  of  Equation  (10.2-9)  result  In  different  finite-difference  formulae. 
The  central -difference  form  of  Equation  (10.2-8)  Is  usually  better  than  the  one-sided  form  of  Equa¬ 
tion  (10.2-13),  although  Equation  (10.2-8)  has  a  lower  bandwidth.  If  the  bandwidth  of  Equa'^on  (10.2-8) 
presents  problems,  a  better  approach  than  Equation  (10.2-13)  Is  to  use 


h[z(.),u(.),t1,ci  -  xU^^j)  -  f[x(t1_l/2),u(ti_l/l),c] 
where  we  have  used  the  notation 


V 1/2  ’  2  (t1  +  Vi> 


and 


^l-i/x* 


x(t^ )  -  x(t1_1) 
*1  ‘  t1-i 


(10.2-14) 

(10.2-15) 

(10.2-16) 


There  are  several  other  reasonable  finite-difference  formulae  applicable  to  this  problem. 

Rigorous  statistical  treatment  of  the  case  In  which  direct  state  derivative  measurements  are  available 
raises  several  complications.  Furthermore,  It  Is  difficult  to  get  a  rigorous  result  In  the  form  typically 
used-an  equation-error  method  based  on  x  measurements  substituted  Into  Equation  (10.2-7).  It  is  probably 
best  to  regard  this  approach  as  an  equation-error  estimator  derived  from  plausible,  but  ad  hoc,  reasoning. 

We  will  briefly  outline  the  statistical  Issues  raised  by  state  derivative  measurements,  without  attempting 
a  complete  analysis.  The  first  problem  Is  that,  for  systems  with  white  process  noise,  the  state  derivative  is 
Infinite  at  every  point  In  time.  (Careful  argument  Is  required  even  to  define  the  derivative.)  We  could  avoid 
this  problem  by  requiring  the  process  noise  to  be  band-limited,  or  by  other  means,  but  the  resulting  estimator 


Mill  not  b«  in  the  desired  fora.  A  heuristic  explanation  Is  that  the  x  measurements  contain  Implicit 
Information  about  the  derivative  (from  the  finite  differences),  and  simple  use  of  the  measured  oerlvatlve 
Ignores  this  Information.  A  rigorous  maximum-likelihood  estimator  Mould  use  both  sources  of  Information. 

This  statement  assumes  that  the  x  measurements  and  the  finite-difference  derivatives  are  Independent  data. 

It  Is  conceivable  that  the  x  "measurements"  are  obtained  as  sums  of  the  x  measurements  (for  Instance,  In 
an  inertial  navigation  unit).  Such  cases  are  merely  Integrated  versions  of  the  finite-difference  approach,  not 
really  comparable  to  cases  of  Independent  x  measurements. 

The  lack  of  a  rigorous  derivation  for  the  state-equation  error  method  Nlth  Independently  measured  state 
derivatives  does  not  necessarily  mean  that  It  Is  a  poor  estimator.  If  the  Information  In  the  state  derivative 
measurements  1$  much  better  than  the  Information  In  the  finite-difference  state  derivatives.  Me  can  justify 
the  approach  as  a  good  approximation.  Furthermore,  as  expressed  In  our  discussions  In  Section  1.4,  an  esti¬ 
mator  does  not  have  to  be  statistically  derived  to  be  a  good  estimator.  For  some  problems,  this  estimator 
gives  adequate  results  Nlth  Ion  computational  costsi  Mhen  this  result  occurs.  It  Is  sufficient  justification 
In  Itself. 

10.2.3  Observation-Equation  Error 

Another  specific-case  of  the  equation-error  method  1$  observation-equation  error.  In  this  case,  the 
specific  form  of  h  comes  from  the  observation  equation,  Ignoring  the  noise.  The  equation  Is  the  same  for 
pure  discrete-time  or  mixed  contlnuous/dlscrete-tlme  systems.  The  observation  equation  for  a  system  Mlth 
additive  noise  Is 

zUj)  -  9[x(t1).u(t1).c]  +  Gni  (10.2-17) 

The  h  function  based  on  this  equation  Is 

hCzf.huf.htj.C]  ■  z^)  -  g[x(t1),u(ti),c]  (10.2-18) 

As  In  the  case  of  state-eauatlon  error,  observation-equation  error  requires  measurements  or  reconstruc¬ 
tions  of  the  state,  because  x(t^)  appears  In  the  equation.  The  comments  In  Section  10.2.1  about  noise  In  the 
state  measurement  apply  here  also.  Observation-equation  error  does  not  require  measurements  of  the  state 
derivative. 

The  observation-equation  error  method  also  requires  that  there  be  some  measurements  In  addition  to  the 
states,  or  the  method  reduces  to  triviality.  If  the  states  Mere  the  only  measurements,  the  observation  equa¬ 
tion  Mould  reduce  to 


z(t^  -  x(t1)  (10.2-19) 

which  has  no  unknoMn  parameters.  There  Mould,  therefore,  be  nothing  to  estimate. 

The  observation-equation  error  method  applies  only  to  estimating  parameters  In  the  observation  equation. 
Unknown  parameters  In  the  state  equation  do  not  enter  this  formulation.  In  fact,  the  existence  of  the  state 
equation  Is  largely  Irrelevant  to  the  method. 

This  Irrelevance  perhaps  explains  why  observation-equation  error  Is  usually  neglected  In  discussions  of 
estimators  for  dynamic  systems.  The  method  Is  essentially  a  direct  application  of  the  static  estimators  of 
Chapter  5,  taking  no  advantage  of  the  dynamics  of  the  system  (the  state  equation).  From  a  theoretical  view¬ 
point,  It  may  seem  out  of  place  In  this  chapter. 

In  practice,  the  observation-equation-error  method  Is  widely  used,  sometimes  contorted  to  look  like  a 
state-equation-error  method.  The  observation-equation-error  method  Is  often  a  competitor  to  an  output-error 
method.  Our  treatment  of  observation-equation  error  Is  Irtended  to  facilitate  a  fair  evaluation  of  such 
choices  and  to  avoid  unnecessary  contortions  Into  state-e  juatlon  error  forms. 


10.3  COMPUTATION 

Me  have  previously  mentioned  that  a  unifying  characteristic  of  the  methods  discussed  In  this  chapter  Is 
their  cooputatlonal  simplicity.  We  have  not,  however,  given  much  detail  on  the  computational  Issues. 

Equation  (10.2-3),  which  encompasses  all  equation-error  forms,  Is  In  the  form  of  Equation  (2.5-1)  If  the 
weighting  matrix  W  Is  known.  Therefore,  the  Gauss-Newton  optimization  algorithm  applies  directly.  Unknown 
W-matrlces  can  be  handled  by  the  method  discussed  In  Sections  5.5  and  8.4. 

In  the  most  general  definition  of  equation  error,  this  Is  nearly  the  limit  of  what  Me  can  state  about 
computation.  The  definition  of  Equation  (10.2-3)  Is  general  enough  to  allow  output  error  and  filter  error  as 
special  cases.  Both  output  error  and  filter  error  have  the  special  property  that  the  dependence  of  h  on  z 
and  u  can  be  cast  In  a  recursive  form,  significantly  lowering  the  computational  costs,  Jecause  of  this 
recursive  form,  the  total  computational  cost  Is  roughly  proportional  to  the  number  of  time  points,  N.  The 
general  definition  of  equation  error  also  encompasses  nonrecursive  forms,  which  could  have  computational  costs 
proportional  to  N*  or  higher  powers. 

The  equation-error  methods  discussed  In  this  chapter  have  the  property  that,  for  each  t^,  the  dependence 
of  h  on  z(.)  and  u(.)  Is  restricted  to  one  or  two  time  points.  Therefore,  the  computational  effort  for  each 
evaluation  of  h  Is  Independent  of  N,  and  the  total  computational  cost  Is  roughly  proportional  to  N.  In 
this  regard,  state-equation  error  and  output-equation  error  are  comparable  to  output  error  and  filter  error. 

For  a  completely  general,  nontlnear  system,  the  computational  cost  of  state-equation  error  or  output-equation 
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error  It  roughly  similar  to  the  cost  of  output  error.  (General  nonlinear  Models  are  currently  Impractical  for 
filter  error  without  using  linearized  approx Imatlons.) 

In  the  large  Majority  of  prt  tlcal  applications,  however,  the  f  and  g  functions  have  special  properties 
which  Make  the  coMputatlonal  costs  of  state-equation  error  and  output-equation  error  far  smaller  than  the  com¬ 
putational  costs  of  output  error  or  filter  error. 

The  first  property  Is  that  the  f  and  g  functions  are  linear  In  c.  This  property  holds  true  even  for 
systems  described  as  nonlinear;  the  nonlinearity  meant  by  the  term  "nonlinear  system"  Is  as  a  function  of  x 
and  u-not  as  a  function  of  £•  Equation  (1.3-2)  Is  a  simple  example  of  a  static  system  nonlinear  In  the 
Input,  but  linear  In  the  parameters.  The  output-errer  method  can  seldom  take  advantage  of  linearity  In  the 
parameters,  even  when  the  system  Is  also  linear  In  x  and  u,  because  the  system  response  Is  usually  a  nonlinear 
function  of  c.  (There  are  some  significant  exceptions  In  special  cases.) 

State-equation  error  and  output-equation  error  methods.  In  contrast,  can  take  excellent  advantage  of 
linearity  In  the  parameters,  even  when  the  system  Is  nonlinear  In  x  and  u.  In  this  situation,  state-equation 
error  and  output-equation  error  Meet  the  conditions  of  Section  2.5.1  for  the  Gauss-Newton  algorithm  to  attain 
the  exact  minimum  In  a  single  Iteration. 

This  Is  both  a  quantitative  and  a  qualitative  computational  Improvement  relative  to  output  error.  The 
quantitative  Improvement  1$  a  division  of  the  computational  cost  by  the  number  of  Iterations  required  for  the 
output-error  method.  The  qualitative  Improvement  Is  the  elimination  of  the  Issues  associated  with  Iterative 
methods:  starting  values,  convergence-testing  criteria,  failure  to  converge,  convergence  accelerators,  multi¬ 
ple  local  solutions,  and  other  Issues.  The  most  commonly  cited  of  these  benefits  Is  that  there  Is  no  need 
for  reasonable  starting  values.  You  can  evaluate  the  equations  at  any  arbitrary  point  (zero  Is  often  con¬ 
venient)  without  affecting  the  result. 

Another  simplifying  property  of  f  and  g,  not  quite  as  universal,  but  true  In  the  majority  of  cases.  Is 
that  each  element  of  c  affects  only  one  element  of  f  or  g.  The  simplest  example  of  this  Is  a  linear  system 
where  the  unknown  parameters  are  Individual  elements  of  the  system  matrices.  With  this  structure.  If  we  con¬ 
strain  W  to  be  diagonal.  Equation  (10.2-3)  separates  Into  a  sum  of  Independent  minimization  problems  with 
scalar  h,  one  problem  for  each  element  of  h.  If  t  Is  the  number  of  elements  of  the  h-vector,  we  now  have 
i  Independent  functions  In  the  form  of  Equation  (10.2-3),  each  with  scalar  h.  Each  element  of  c  affects 
one  and  only  one  of  these  scalar  functions. 

This  partitioning  has  the  obvious  benefit,  common  to  most  partitioning  algorithms,  that  the  sum  of  the 
t-problems  with  scalar  h  requires  less  computation  than  the  unpartitioned  vector  problem.  The  outer-product 
computation  of  Equation  (2.5-11),  often  the  most  time-consuming  part  of  the  algorithm.  Is  proportional  to  the 
square  of  the  number  of  unknowns  and  to  t.  Therefore,  If  the  unknowns  are  evenly  distributed  among  the  t 
elements  of  h,  the  computational  cost  of  the  vector  problem  could  be  as  much  as  i*  times  the  cost  of  each  of 
the  scalar  problems.  Other  portions  of  the  computational  cost  and  overhead  will  reduce  this  factor  somewhat, 
but  the  Improvement  Is  still  dramatic. 

Another  benefit  of  the  partitioning  Is  that  It  allows  us  to  avolc  Iteration  when  the  noise  covariances  are 
unknown.  With  this  partitioning,  the  minimizing  values  of  t  are  Independent  of  W.  The-  normal  role  of  W 
Is  In  weighing  the  Importance  of  fitting  the  different  elements  of  the  h.  One  value  of  e  might  fit  one 
element  of  h  best,  while  another  value  of  5  fits  another  element  of  h  best;  W  establishes  how  to  strike 
a  compromise  among  these  conflicting  alms.  Since  the  partitioned  problem  structure  makes  the  different  ele¬ 
ments  of  h  Independent,  W  Is  largely  Irrelevant.  Therefore  we  can  estimate  the  elements  of  e  using  any 
arbitrary  value  of  W  (usually  an  Identity  matrix).  If  we  want  an  estimate  of  W,  we  can  compute  It  after  we 
estimate  the  other  unknowns. 

The  conblned  effect  of  these  computational  Improvements  Is  to  make  the  computational  cost  of  the  state- 
equation  error  and  output-equation  error  methods  negligible  In  many  applications.  It  Is  common  for  the  compu¬ 
tational  cost  of  the  actual  equation-error  algorithm  to  be  dwarfed  by  the  overhead  costs  of  obtaining  the  data, 
plotting  the  results,  and  related  computations. 


10.4  DISCUSSION 

The  undebated  strong  points  of  the  state-equation-error  and  output-equation-error  methods  are  their  sim¬ 
plicity  and  low  computational  cost.  Most  Important  Is  that  Gauss-Newton  gives  the  exact  minimum  of  the  cost 
function  without  Iteration.  Because  the  methods  are  noniterative,  they  require  no  starting  estimates.  These 
methods  have  been  used  In  many  applications,  sometimes  under  different  names. 

The  weaknesses  of  these  methods  stem  from  their  assumptions  of  perfect  state  measurements.  Relatively 
small  amounts  of  noise  In  the  measurements  can  cause  significant  bias  errors  In  the  estimates.  If  a  measure¬ 
ment  of  some  state  Is  unavailable,  or  If  an  Instrument  falls,  these  methods  are  not  directly  applicable  (though 
such  problems  are  sometimes  handled  by  state  reconstruction  algorithms). 

State-equation-error  and  output-equation-error  methods  can  be  used  with  either  of  two  distinct  approaches, 
depending  upon  the  application.  The  first  approach  Is  to  accept  the  problem  of  measurement-noise  sensitivity 
and  to  emphasize  the  computational  efficiency  of  the  methou.  This  approach  Is  appropriate  when  computational 
cost  Is  a  more  Important  consideration  than  accuracy. 

For  example,  state-equation  error  and  output-equation  error  methods  are  popular  for  obtaining  starting 
values  for  Iterative  procedures  such  as  output  error.  In  such  applications,  the  estimates  need  only  be  accu¬ 
rate  enough  to  cause  the  Iterative  methods  to  converge  (presumably  to  better  estimates). 

Another  common  use  for  state-equation  error  and  output-error  Is  to  select  a  model  from  a  large  number  of 
candidates  by  estimating  the  parameters  In  each  candidate  model.  Once  the  model  form  Is  selected,  the  rough 
parameter  estlmetes  can  be  refined  by  some  other  method. 


The  second  epproech  to  using  state-equation-error  or  output-equation-error  methods  Is  to  spend  the  time 
and  effort  necessary  to  get  accurate  results  from  them,  which  first  requires  accurate  state  measurements  with 
low  noise  levels.  In  many  applications  of  these  methods,  most  of  the  work  lies  In  filtering  the  data  and 
reconstructing  estimates  of  unmeasured  states.  (A  Kalman  filter  can  sometimes  be  helpful  here,  provided  that 
the  filter  does  not  depend  upon  the  parameters  to  be  estimated.  This  condition  requires  a  special  problem 
structure.)  The  total  cost  of  obtaining  good  estimates  from  these  methods,  Including  the  cost  of  data  pre¬ 
processing,  may  be  comparable  to  the  cost  of  more  complicated  Iterative  algorithms  that  require  less 
preprocessing.  The  trade-off  Is  highly  dependent  on  application  variables  such  as  the  required  accuracy  of 
the  estimates,  the  quality  of  the  available  Instrumentation,  and  the  existence  of  Independent  needs  for 
accurate  state  measurements. 
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CHAPTER  11 


11.0  ACCURACY  OF  THE  ESTIMATES 

Parameter  estimates  fro*  real  systems  are.  by  thalr  nature.  Imperfect .  The  accuracy  of  the  estimates  Is 
a  pervasive  Issue  In  the  various  stages  of  application,  from  the  problem  statement  to  the  evaluation  and  use 
of  the  results. 

He  Introduced  the  subject  of  parameter  estimation  In  Section  1.4,  using  concepts  of  errors  in  the  esti¬ 
mates  and  adequacy  of  the  results.  The  subsequent  chapters  have  largely  concentrated  on  the  derivation  of 
algorithms.  These  derivations  are  all  related  to  accuracy  Issues,  based  on  the  definitions  and  discussions  In 
Chapter  4.  However,  the  questions  about  accuracy  have  been  largely  overshadowed  by  the  details  of  deriving  and 
Implementing  the  elgorlthms. 

In  this  chapter,  we  return  the  emphasis  to  the  critical  Issue  of  accuracy.  The  final  judgment  of  the 
parameter  estimation  process  for  a  particular  application  Is  based  on  the  accuracy  of  the  results.  He  examine 
the  evaluation  of  the  accuracy,  factors  contributing  to  Inaccuracy,  and  means  of  laproving  accuracy.  A  truly 
coaprehenslve  treatment  of  the  subject  of  accuracy  Is  laposslble.  He  restrict  our  discussion  largely  to 
generic  Issues  related  to  the  theory  and  methodology  of  parameter  estimation. 

To  make  effective  use  of  parameter  estimates,  we  must  have  some  gauge  of  their  accuracy,  be  It  a  statisti¬ 
cal  measure,  an  Intuitive  guess,  or  some  other  source.  If  we  absolutely  cannot  distinguish  the  extremes  of 
accurate  versus  worthless  estlsiates,  we  must  always  consider  the  possibility  that  the  estimates  are  worthless. 
In  which  case  the  estimates  could  not  be  used  In  any  application  In  which  their  validity  was  Important. 
Therefore,  measures  of  the  estimate  accuracy  are  as  Important  as  are  the  estimates  themselves.  Various  means 
of  judging  the  accuracy  of  parameter  estimates  are  In  current  use. 

He  will  group  the  uses  for  measures  of  estimate  accuracy  Into  three  general  classes.  The  first  class  of 
use  Is  In  planning  the  parameter  estimation.  Predictions  of  the  estimate  accuracy  can  be  used  to  evaluate  the 
adequacy  of  the  proposed  experiments  and  Instrumentation  system  for  the  parameter  estimation  on  the  proposed 
model.  There  are  limitations  to  this  usage  because  It  Involves  predicting  accuracy  before  the  actual  data  are 
obtained.  Unexpected  problems  can  always  cause  degradation  of  the  results  compared  to  the  predictions.  The 
accuracy  predictions  are  most  useful  In  Identifying  experiments  that  have  no  hope  of  success. 

The  second  use  Is  In  the  parameter  estimation  process  Itself.  Measures  of  accuracy  can  help  detect 
various  problems  In  the  estimation,  from  modeling  failures,  data  problems,  program  bugs,  or  other  sources. 
Another  facet  of  this  class  of  use  Is  the  comparison  of  different  estimates.  The  comparisons  can  be  between 
two  different  models  or  methods  applied  to  the  same  data  set,  between  estimates  from  Independent  data  sets,  or 
between  predictions  and  estimates  from  the  experimental  data.  In  any  of  these  events,  measures  of  accuracy 
can  help  determine  which  of  the  conflicting  values  Is  best,  or  whether  some  compromise  between  them  should  be 
considered.  Comparison  of  the  accuracy  measures  with  the  differences  In  the  estimates  Is  a  means  to  determine 
If  the  differences  are  significant.  The  magnitude  of  the  observed  differences  between  the  estimates  Is,  In 
Itself,  an  Indicator  of  accuracy. 

The  third  use  of  measures  of  accuracy  Is  for  presentation  with  the  final  estimates  for  the  user  of  the 
results.  If  the  estimates  are  to  be  used  In  a  control  system  design,  for  Instance,  knowledge  of  their  accuracy 
Is  useful  In  evaluating  the  sensitivity  of  the  control  system.  If  the  estimates  are  to  be  used  by  an  explicit 
adaptive  or  learning  control  system,  then  It  Is  Important  that  the  accuracy  evaluation  be  systematic  enough  to 
be  automatically  Implemented.  Such  lamedlate  use  of  the  estimates  precludes  the  Intercession  of  engineering 
judgment:  the  evaluation  of  the  estimates  must  be  entirely  automatic.  Such  control  systems  must  recognize  poor 
results  and  suitably  discount  them  (or  ensure  that  they  never  occur- an  overly  optimistic  goal). 

The  single  most  critical  contributor  to  getting  accurate  parameter  estimates  In  practical  problems  Is  the 
analyst's  understanding  of  the  physical  system  and  '  Instrumentation.  The  most  thorough  knowledge  of  param¬ 
eter  estimation  theory  and  the  use  of  the  most  powerful  techniques  do  not  compensate  for  poor  understanding  of 
the  system.  This  statement  relates  directly  to  the  discussion  In  Chapter  1  about  the  "black  box"  Identifica¬ 
tion  problem  and  the  roles  of  Independent  knowledge  versus  system  Identification.  The  principles  discussed  In 
this  chapter,  although  no  substitute  for  an  understanding  of  the  system,  are  a  necessary  adjunct  to  such 
understanding. 

Before  proceeding  further,  we  need  to  review  the  definition  of  the  term  "accuracy"  as  It  applies  to  real 
data.  A  system  Is  never  described  exactly  by  the  simplified  models  used  for  analysis.  Regcrdless  of  the 
sophistication  of  the  model,  unexplained  sources  of  modeling  error  will  always  remain.  There  Is  no  unique, 
correct  model. 

The  concept  of  accuracy  Is  difficult  to  define  precisely  If  no  correct  model  exists.  It  Is  easiest  to 
approach  by  considering  the  problem  In  two  parts:  estimation  and  modeling.  For  analyzing  the  estimation  prob¬ 
lem,  we  assume  that  the  model  describes  the  system  exactly.  The  definition  of  accuracy  Is  then  precise  and 
quantitative.  Many  results  are  available  In  the  subject  area  of  estimation  accuracy.  Sections  11.1  and  11.2 
discuss  several  of  them. 

The  modeling  problem  addresses  the  question  of  whether  the  form  of  the  model  can  describe  the  system 
adequately  for  Its  intended  use.  There  Is  little  guide  from  the  theory  In  this  area.  Studies  such  as  those 
of  Gupta,  Hall,  and  Trankle  (1978),  Flske  and  Price  (1977),  and  Akalke  (1974),  discuss  selection  of  the  best 
model  from  a  set  of  candidates,  but  do  not  consider  the  more  basic  Issue  of  defining  the  candidate  models. 
Section  11.4  considers  this  point  In  more  detail. 

For  the  most  part,  the  determination  of  model  adequacy  Is  based  on  engineering  judgment  and  problem- 
specific  analysis  relying  heavily  on  the  analyst's  understanding  of  the  physics  of  the  system.  In  some  cases. 
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me  can  t«$t  modal  adequacy  by  demonstration:  If  we  try  the  model  and  It  achieves  Its  purpose.  It  was  obviously 
adequate.  Such  tests  are  not  always  practical,  however.  This  method  assumes,  of  course,  that  the  test  was 
comprehensive.  Such  assumptions  should  not  be  made  lightly;  they  have  cost  lives  when  systems  encountered 
untested  conditions. 

After  considering  estimation  and  modeling  as  separate  problems,  we  need  to  look  at  their  interactions  to 
complete  the  discussion  of  accuracy.  We  need  to  consider  the  estimates  that  result  from  a  model  Judged  to  be 
adequate,  although  not  exact.  As  In  the  modeling  problem,  this  process  Involves  considerable  subjective  judg¬ 
ment,  although  we  can  obtain  some  quantitative  results. 

We  can  examine  some  specific,  postulated  sources  of  modeling  error  through  simulations  or  analyses  that 
use  more  complex  models  than  are  practical  or  desirable  In  the  parameter  estimation.  Such  simulations  or 
analyses  can  Include,  for  example,  models  of  specific,  postulated  Instrumentation  errors  (Hodge  and  Bryant, 
1978;  and  Sorensen,  1972).  Maine  and  Illff  (1981b)  present  some  more  general,  but  less  rigorous,  results. 


11.1  CONFIDENCE  REGIONS 


The  concept  of  a  confidence  region  Is  central  to  the  analytical  study  of  estimation  accuracy.  In  general 
terms,  a  confidence  region  Is  a  region  within  which  we  can  be  reasonably  confident  that  the  true  value  of  c 
lies.  Accurate  estimates  corre  pond  to  small  confidence  regions  for  a  given  level  of  confidence.  Note  that 
small  confidence  regions  Imply  large  confidence;  In  order  to  avoid  this  apparent  Inversion  of  terminology,  the 
term  "uncertainty  region"  Is  sometimes  used  In  place  of  the  term  "confidence  region."  The  following  subsec¬ 
tions  define  confidence  regions  more  precisely. 


For  continuous,  nonsingular  estimation  problems,  the  probability  of  any  point  estimate's  being  exactly 
correct  Is  zero.  We  need  a  concept  such  as  the  confidence  region  to  make  statements  with  a  nonzero  confidence. 
Throughout  the  discussion  of  confidence  regions,  we  assume  that  *he  system  model  Is  correct;  that  Is,  we  assume 
that  i  has  a  true  value  lying  In  the  parameter  space.  In  later  sections  we  will  consider  Issues  relating  to 
modeling  error. 

11.1.1  Random  Parameter  Vector 


Let  us  consider  first  the  case  In  which  c  Is  a  random  variable  with  a  known  prior  distribution.  This 
situation  usually  Implies  the  use  of  an  MAP  estimator. 

In  this  case,  t  has  a  posterior  distribution,  and  we  can  define  the  posterior  probability  that  t  lies 
In  any  fixed  region.  Although  we  will  use  the  posterior  distribution  of  t  as  the  context  for  this  discus¬ 
sion.  we  can  equally  well  define  prlc  confidence  regions.  None  of  the  following  development  depends  upon  our 
working  with  a  posterior  distribution.  For  simplicity  of  exposition,  we  will  assume  vhat  the  posterior  dlstrf 
butlon  of  c  has  a  density  function.  The  posterior  probability  that  t  lies  in  a  region  R  Is  then 

P(R)  ■  /  p(E|Z)d|e|  (11.1-1) 

R 

We  daflne  R  to  be  a  confidence  region  for  the  confidence  level  a  If  P(R)  *  a,  and  no  other  region 
with  the  same  probability  Is  smaller  than  R.  We  use  the  volume  of  a  region  as  a  measure  of  Its  size. 

Theorem  11.1  Let  R  be  the  set  of  all  points  with  pU|Z)  t  c,  where  c  Is 
a  constant.  Then  R  Is  a  confidence  region  for  the  confidence  level 
x  »  P(R). 

Proof  Let  R  be  as  defined  above,  and  let  R'  be  any  other  region  with 
f(R"l  1  a.  We  need  to  prove  that  the  volume  of  R'  must  be  greater  than  or 
equal  to  that  of  R.  We  define  T  •  R  n  R' ,  S  -  R  n  K' ,  and  S'  ■  R*  n  R. 

Then  T,  S,  and  S'  are  disjoint,  R  «  T  u  S,  and  R'  ■  T  >>  S'.  Because 

S  c  R,  we  must  have  p(c | 3)  t  c  everywhere  in  S.  Conversely,  $‘  c  R,  so 
p(tlZ)  <  c  everywhere  In  S'.  In  order  for  P(R')  ■  P(R),  we  must  have 
p(s  )  *  P($).  Therefore,  the  volume  of  S'  must  be  greater  than  or  equal  to 

that  of  S.  The  volume  of  R'  must  then  be  greater  than  that  of  R,  com¬ 

pleting  the  proof. 

It  Is  often  convenient  io  characterize  a  closed  region  by  Its  boundary.  The  boundaries  of  the  confidence 
regions  defined  by  rheo.em  11.1  are  Isoclines  of  the  posterior  density  function  p(?|Z). 

We  can  write  the  confidence  region  derived  In  the  above  theorem  as 

R  -  {x:  Pc|Z(x|Z)  i  c)  (11.1-2) 

We  must  use  the  full  notation  for  the  probability  density  function  to  avoid  confusion  In  the  following  manipu¬ 
lations.  For  consistency  with  the  following  section.  It  Is  convenient  to  re-express  the  confidence  region  In 
terms  of  the  density  function  of  tho  error. 


e  ■  5  -  t 

The  estimate  l  Is  a  deterministic  function  of  Z;  therefore.  Equation  (11.1-3)  trivially  gives 

pt|Z(*iz)  ■  Pe|z(x  ”  Slz> 


(11.1-3) 

(11.1-4) 
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Substituting  this  Into  Equation  (11.1*2)  gives  the  expression 

R  ■  tx:  Ptj2(*  *  £|Z)  i  c)  (11.1-5) 

Substituting  x  ♦  c  for  x  In  Equation  (11.1*5)  gives  the  convenient  form 

R  ■  ♦  x:  p#j2(x|Z)  a  c]  (11.1-6) 

This  form  shows  the  boundaries  of  the  confidence  regions  to  be  translated  Isoclines  of  the  error-density 
function. 

Exact  determination  of  the  confidence  regions  Is  Impractical  except  In  simple  cases.  One  such  case  occurs 
when  £  Is  scalar  and  p(c|Z)  Is  unlmoUai.  An  Isocline  then  consists  of  two  points,  and  the  line  segment 
between  the  two  points  1$  the  confidence  region.  In  this  one-dimensional  case,  the  confidence  region  1$  often 
called  a  confidence  Interval. 

Another  simple  case  occurs  when  the  posterior  density  function  Is  In  some  standard  family  of  density 
functions  expressible  In  closed  form.  This  Is  most  coanonly  the  family  of  Gaussian  density  functions.  An  Iso¬ 
cline  of  a  Gaussian  density  function  with  mean  m  and  nonsingular  covariance  A  Is  a  set  of  x  values 
satisfying 

(x  -  m)*A“l(x  -  m)  ■  c  (11.1-7) 

This  is  the  equation  of  an  ellipsoid. 

For  problems  not  flttlrg  Into  one  of  these  special  cases,  we  usually  must  make  approximations  In  the  com¬ 
putation  of  the  confidence  regions.  Section  11.1.3  discusses  the  most  coamnn  approximation. 

11.1.2  Nonrandom  Parameter  Vector 

When  £  1$  simply  an  unknown  parameter  with  no  random  nature,  the  development  of  confidence  regions  Is 
more  oblique,  but  the  result  Is  similar  In  form  to  the  results  of  the  previous  section.  The  same  comnants 
apply  when  we  wish  tc  Ignore  any  prior  distribution  of  £  and  to  obtain  confidence  regions  based  solely  on 
the  current  experimental  data.  These  situations  usually  Imply  the  use  of  MLE  estimators. 

In  neither  of  these  situations  can  we  meaningfully  discuss  the  probability  of  £  lying  In  a  given  region. 
We  proceed  as  follows  to  develop  a  substitute  concept:  the  estimate  t  Is  a  function  of  the  observation  Z, 
which  has  a  probability  distribution  conditioned  on  £.  Therefore,  we  can  define  a  probability  distribution 
of  £  conditioned  on  £.  We  will  assume  that  this  distribution  has  a  density  function  p£|(. 

For  a  given  value  of  £,  the  Isoclines  of  pf|s  define  boundaries  of  confidence  regions  for  £.  Let  R2 
be  such  a  confidence  region,  with  confidence  level  o. 

Ri  "  {x:  P||C(*U)  *  c>  (11.1-8) 

It  Is  convenient  to  define  R2  In  terms  of  the  error  density  function  Pe|£>  using  the  relation 

P||5(*U)  '  Pe|5(e  -  *10  (11.1-9) 

This  gives 

Rj  ■  U  *  x:  Pejc(x|£)  >  c}  (11.1-10) 

The  estimate  £  has  probability  a  of  being  In  R,.  For  this  chapter,  we  are  more  Interested  in  the 
situation  where  we  know  the  value  of  £  and  seek  to  define  a  confidence  region  for  £,  which  Is  unknown.  We 
can  define  such  a  confidence  region  for  £,  given  $,  In  two  steps,  starting  with  the  region  Rx. 

The  first  step  is  to  define  a  region  R2  which  Is  a  mirror  Image  of  R,.  A  point  £  -  x  In  the  region 

Rx  reflects  onto  the  point  £  +  x  In  R2,  as  shown  In  Figure  (11.1-1).  We  can  thus  write  R2  as 

Rj  -  {£  +  x:  pe|5(x|£)  >  c)  (11.1-11) 

This  reflection  Interchanges  £  and  £;  therefore,  £  is  In  R,  if  and  only  1fA  t  is  in  Ra.  Because  there  Is 

probability  a  that  t  lies  In  Rlt  there  is  the  same  probability  a  that  £  lies  In  R2. 


To  be  technically  correct,  we  must  be  careful  about  the  phrasing  of  this  statement.  Because  the  true  value 
£  Is  not  random,  it  makes  no  sense  to  say  that  £  has  probability  a  of  lying  In  R2.  The  randomness  Is  In 
the  construction  of  the  region  R2  because  R2  depends  on  the  estimate  £,  which  depends  In  turn  on  the  noise* 
contaminated  observations.  We  can  sensibly  say  that  the  region  R2,  constructed  In  this  manner,  has  probability 
a  of  covering  the  true  value  £.  This  concept  of  a  region  covering  the  fixed  point  £  replaces  the  concept  of 
the  point  £  lying  In  a  fixed  region.  The  distinction  Is  more  Important  In  theory  than  In  practice. 

Although  we  have  defined  the  region  R2  In  principle,  we  cannot  construct  the  region  from  the  data  avail¬ 
able  because  R.  depends  on  the  value  of  £,  which  Is  unknown.  Our  next  step  Is  to  construct  a  region  R2, 
which  approximates  R2,  but  does  not  depend  on  the  true  value  of  £.  We  base  the  approximation  on  the  assump¬ 
tion  that  pe|c  Is  approximately  Invariant  as  a  function  of  £;  that  Is 


no 


P#|C(*U)  ■  P#|t(*U  +  «)  (11.1-12) 

This  approximation  is  unlikely  to  be  valid  for  Urge  values  of  6  except  In  single  cases.  For  small  values 
of  £,  the  approximation  Is  usually  reasonable. 

We  define  the  confidence  region  Rt  by  applying  this  approximation  to  Equation  (11.1-11).  using  £  -  c 
for  «. 


R,  ■  (C  ♦  x:  P#|€(x|c)  *  c)  (11.1-13) 

The  regie. i  R,  depends  only  on  £,  p«i_,  and  the  arbitrary  constant  c.  The  function  p#|£  Is  presumed  known 
from  the  start,  and  £  Is  the  estimate  computed  by  the  methods  described  In  previous  chapters.  In  principle, 
we  have  sufficient  Information  to  compute  the  region  R,.  Practical  application  requires  either  that  pa|r 
be  In  one  of  the  simple  forms  described  In  Section  11.1.1,  or  that  we  make  further  approximations  as  discussed 
In  Section  11.1.3. 

If  £  -  c  Is  small  (that  Is,  If  the  estimate  Is  accurate),  then  R,  will  likely  be  a  close  approximation 
to  Rt.  If  £  -  c  1$  large,  then  the  approximation  Is  questionable.  The  result  Is  that  we  are  unable  to 
define  large  confidence  regions  accurately  except  In  special  cases.  We  can  tell  that  the  confidence  region  Is 
large,  but  Its  precise  size  and  shape  are  difficult  to  determine. 

Note  that  the  confidence  region  for  nonrandom  parameters,  defined  by  Equation  (11.1-13),  Is  almost  Iden¬ 
tical  In  form  to  the  confidence  region  for  random  parameters,  defined  by  Equation  (11.1-6).  The  only  differ¬ 
ence  In  the  form  Is  what  the  density  functions  are  conditioned  on. 

11.1.3  Gaussian  Approximation 

The  previous  sections  have  derived  the  boundaries  of  confidence  regions  for  both  random  and  nonrandom 
parameter  vectors  In  terms  of  Isoclines  of  probability  density  functions  of  the  error  vector.  Except  In 
special  cases,  the  probability  density  functions  are  too  complicated  to  allow  practical  computation  of  the 
exact  Isoclines.  Extreme  precision  In  the  computation  of  the  confidence  regions  Is  seldom  necessary;  we  have 
already  made  approximations  In  the  definition  of  confidence  regions  for  nonrandom  parameters.  In  this  section, 
we  Introduce  approximations  which  allow  relatively  easy  computation  of  confidence  regions. 

The  central  Idea  of  this  section  Is  to  approximate  the  pertinent  probability  density  functions  by  Gaussian 
density  functions.  As  discussed  In  Section  11.1.1,  the  Isoclines  of  Gaussian  density  functions  are  ellipsoids, 
which  are  easy  to  compute.  We  call  these  "confidence  ellipsoids"  or  "uncertainty  ellipsoids."  In  many  ca-es, 
we  can  justify  the  Gaussian  approximation  with  arguments  that  the  distributions  asymptotically  approach 
Gausslans  as  the  amount  of  data  Increases.  Section  5.4.2  discusses  some  pertinent  asymptotic  results. 

A  Gaussian  approximation  Is  defined  by  Its  mean  and  covariance.  We  will  consider  appropriate  choices  for 
the  mean  and  covariance  to  make  the  Gaussian  density  function  a  reasonable  approximation.  An  obvious  possibil¬ 
ity  Is  to  set  the  mean  and  covariance  of  the  Gaussian  approximation  to  match  the  mean  and  covariance  of  the 
original  density  function;  we  are  often  forced  to  settle  for  approximations  to  the  mean  and  covariance  of  the 
original  density  function,  the  exact  values  being  Impractical  to  compute.  Another  possibility  Is  to  use 
Equations  (3.5-17)  and  (3.5-18).  We  will  Illustrate  the  use  of  both  of  these  options. 

Consider  first  the  case  of  an  MLE  estimator.  Equation  (11.1-13)  defines  the  confidence  region.  We  will 

use  covariance  matching  to  define  the  Gaussian  approximation  to  Pelt.  The  exact  mean  and  covariance  of 

pe|£  are  difficult  to  compute,  but  there  are  asymptotic  results  which  give  reasonable  approximations. 

We  use  zero  as  an  approximation  to  the  mean  of  Pek>  this  approximation  Is  based  on  MLE  estimators  being 
asymptotically  unbiased.  Because  MLE  estimators  are  efficient,  the  Cramer-Rao  bound  gives  an  asymptotic 
approximation  for  the  covariance  of  pek  as  the  Inverse  of  the  Fisher  Information  matrix  M({).  We  can  use 
either  Equation  (4.2-19)  or  (4.2-24)  is  equivalent  expressions  for  the  Fisher  Information  matrix.  Equa¬ 
tion  (5.4-11)  gives  the  particular  form  of  M(0  for  static  nonlinear  systems  with  additive  Gaussian  noise. 

Both  £  and  M(£)  are  readily  available  In  practical  application.  The  estimate  £  Is  the  primary  output 
of  a  parameter  estimation  program,  and  most  MLE  parameter-estimation  programs  compute  M(£)  or  an  approximation 
to  It  as  a  by-product  of  Iterative  minimization  of  the  cost  function. 

Now  consider  the  case  of  an  MAP  estimator.  We  need  a  Gaussian  approximation  to  p(e|z).  Equa¬ 
tions  (3.5-17)  and  (3.5-18)  provide  a  convenient  basis  for  such  an  approximation.  By  Equation  (3.5-17),  we  set 

the  mean  of  the  Gaussian  approximation  equal  to  the  point  at  which  p(e[z)  Is  a  maximum;  by  definition  of  the 
MAP  estimator,  this  point  Is  zero. 

We  then  set  the  covariance  of  the  Gaussian  approximation  to 

A  ■  [-v*  m  p(e | z)]-1  (11.1-14) 

evaluated  at  e  ■  C-  For  static  nonlinear  systems  with  additive  Gaussian  noise,  Equation  (11.1-14)  reduces  to 
the  form  of  Equation  (5.4-12),  which  we  could  also  have  obtained  by  approximate  covariance  matching  arguments. 
This  form  for  the  covariance  Is  the  same  as  that  used  In  the  MLE  confidence  ellipsoid,  with  the  addition  of  the 
prior  covariance  term.  As  the  prior  covariance  goes  to  Infinity,  the  confidence  ellipsoid  for  the  MAP  estimator 
approaches  that  for  the  MLE  estimator,  as  we  would  anticipate. 

Both  the  MLE  and  MAP  confidence  ellipsoids  take  the  form 


(x  -  £)*A_1(x  -  £)  -  c 


(11.1-15) 


where  A  Is  sn  approximation  to  the  e*Yor-covar1ance  matrix.  We  have  suggested  suitable  approximations  In  the 
abova  paragraphs,  out  most  approximations  to  tha  arror  covariance  »ra  equally  accaptabla.  Tha  cholca  Is 
usually  dictated  by  Mhat  Is  convanlantly  avallabla  In  a  glvan  program. 

11,1.4  Nonstatlstlcal  Darlvatlon 

Wa  can  altarnataly  darlva  tha  confldanca  ellipsoids  for  MAP  and  MLE  estimators  from  a  nonstatlstlcal  view* 
point.  This  derivation  obtains  tha  same  result  as  tha  statistical  approach  and  Is  easier  to  follow.  Compari¬ 
son  of  tha  Ideas  used  In  thw  statistical  and  nonstatlstlcal  derivations  reveals  tha  close  relationships  between 
tha  statistical  characteristics  of  the  estimates  and  tha  numerical  problems  of  cooputlng  them.  Tha  nonstatls¬ 
tlcal  approach  generalizes  easily  to  estimators  and  models  for  which  precise  statistical  descriptions  are 
difficult. 

The  nonstatlstlcal  derivation  presuus  that  the  estimate  Is  defined  as  the  minimizing  point  of  some  cost 
function.  We  examine  the  shape  of  this  cost  function  as  It  affects  the  numerical  minimization  problem  In  the 
area  of  the  minimum.  For  current  purposes,  we  are  not  concerned  with  start-up  problems.  Isolated  local  minima, 
and  other  problems  manifested  far  from  the  solution  point.  A  relatively  flat,  Ill-defined  minimum  corresponds 
to  a  questionable  estimate;  the  extreme  case  of  this  If  function  without  a  discrete  local  minimum  point.  A 
steep,  well-defined  minimum  corresponds  to  a  reliable  estimate. 

With  this  justification,  we  define  a  confidence  region  to  be  the  set  of  points  with  cost-function  values 
less  than  or  equal  to  soma  constant.  Different  values  of  the  constant  give  different  confidence  levels.  The 
boundary  of  such  a  region  1$  an  Isocline  of  the  cost  function. 

We  then  approximate  the  cost  function  In  the  neighborhood  of  the  minimum  by  a  quadratic  Taylor-serles 
expansion  about  the  minimum  point. 

0(0  »  0(0  ♦  \  (i  -  0*[V*J(0]U  -  0  (11.1-16) 

The  Isoclines  of  this  quadratic  approximation  are  the  confidence  ellipsoids. 

(C  -  0*[V‘J(«)]U  -  0  “  c  (11.1-17) 

The  second  gradient  of  an  MLE  or  MAP  cost  function  Is  an  asyaptotlc  approximation  to  the  appropriate  error 
covariance.  Therefore,  Equation  (11.1-17)  gives  the  same  shape  confidence  ellipsoids  as  we  previously  derived 
on  a  statistical  basis.  In  practice,  the  Gauss-Newton  .«*  other  approximation  to  the  second  gradient  Is 
usually  used. 

The  constant  c  determines  the  size  of  the  confidence  ellipsoid.  The  nonstatlstlcal  derivation  gives  no 
obvious  basis  for  selecting  a  value  of  c.  The  value  c  •  1  gives  the  most  useful  correspondence  to  the 
statistical  derivation,  as  we  will  see  In  Section  11.2.1. 

Figures  (11.1-2)  and  (11.1-3)  Illustrate  the  construction  of  one-dimensional  confidence  ellipsoids  using 
the  nonstatlstlcal  definition. 


11.2  ANALYSIS  OF  THE  CONFIDENCE  ELLIPSOID 

The  confidence  ellipsoid  gives  a  comprehensive  picture  of  the  theoretically  likely  errors  In  the  estimate. 
It  Is  difficult,  however,  to  display  the  Information  content  of  the  ellipsoid  on  a  two-dimensional  sheet  of 
paper.  In  the  applications  we  most  coamonly  work  on,  there  are  typically  10  to  30  unknown  parameters;  that  Is, 
the  ellipsoid  Is  10-  to  30-dlmenslonal .  We  can  print  the  covariance  matrix  which  defines  the  shape  of  the 
ellipsoid,  but  It  Is  difficult  to  draw  useful  conclusions  from  such  a  presentation  format.  The  problem  of 
meaningful  presentation  Is  further  compounded  when  analyzing  hundreds  of  experiments  to  obtain  parameter 
estimates  under  a  wide  variety  of  conditions. 

In  the  following  sections,  we  discuss  simplified  statistics  that  characterize  Important  features  of  the 
confidence  ellipsoids  In  ways  that  are  easy  to  describe  and  present.  The  emphasis  In  these  statistics  Is  on 
reducing  the  dimensionality  of  the  problem.  Many  Important  questions  about  accuracy  reduce  to  one-dimensional 
forms,  such  as  the  accuracy  of  the  estimate  of  each  element  of  the  parameter  vector. 

All  of  the  statistics  discussed  here  are  functions  of  the  matrix  A,  which  defines  the  shape  of  the  confi¬ 
dence  ellipsoid.  We  have  seen  above  that  A  Is  an  approximation  to  the  error-covariance  matrix.  These  two 
viewpoints  of  A  will  provide  us  with  geometrical  and  statistical  Interpretations.  A  third  Interpretation 
comes  from  viewing  A  as  the  Inverse  of  the  second  gradient  of  the  cost  function.  In  practice,  A  Is  usually 
computed  from  the  Gauss-Newton  or  other  convenient  approximation  to  the  second  gradient. 

These  statistics  are  closely  linked  to  some  of  the  basic  sources  of  estimation  errors  and  difficulties. 

We  will  Illustrate  the  discussion  with  Idealized  examples  of  these  classes  of  difficulties.  The  ex  ct  means 
of  overcoming  such  difficulties  depends  on  the  problem,  but  the  first  step  Is  to  understand  the  mechanism 
causing  the  difficulty.  In  a  surprising  number  of  applications,  the  major  difficulties  are  cases  of  the  simple 
Idealizations  discussed  here. 


11.2.1  Sensltlvll 


The  sensitivity  Is  the  simplest  of  the  statistics  relating  to  the  confidence  ellipsoid.  Although  the  sen¬ 
sitivity  has  both  a  statistical  and  a  nonstatlstlcal  Interpretation,  the  use  of  the  statistical  Interpretation 
Is  relatively  rare.  The  term  "sensitivity"  comes  from  the  nonstatlstlcal  Interpretation,  which  we  will  discuss 
first. 


Fro*  the  nor.statlstlcal  viewpoint,  the  sensitivity  Is  •  measure  of  how  much  the  cost-function  value 
changes  for  a  given  change  In  a  scalar  parameter  value.  The  most  coewton  definition  of  the  sensitivity  with 
respect  to  a  parameter  Is  the  second  partial  derivative  of  the  cost  function  with  respect  to  the  parameter. 

s  .  i^±Sl  (11.2-1) 

*1 

For  the  purposes  of  this  chapter,  we  are  Interested  In  the  sensitivity  evaluated  at  the  minimum  point  of  the 
cost  function;  we  will  take  this  as  part  of  the  definition  of  the  sensitivity. 

The  Ci  In  Equation  (11.2-1)  can  be  any  scalar  function  of  the  t  vector.  In  most  cases,  Cl  Is  one  of 
the  elements  of  the  c  vector.  For  s1mp11c1ty>  we  will  assume  for  the  rest  of  this  section  that  ci  1$  the 
1th  element  of  c*  Generalizations  are  straightforward.  When  Ci  Is  the  1th  element  of  c,  the  secorl 
partial  derivative  with  respect  to  Ci  Is  the  1th  diagonal  element  of  the  second-gradient  matrix. 

S1  -  tv'JU)]^  -  (A*1)^  (11.2-2) 

The  sensitivity  has  a  simple  geometric  Interpretation  based  on  the  confidence  ellipsoid.  Use  the  value 
c  »  1  in  Equation  (11.1-17)  to  define  a  confidence  ellipsoid.  Draw  a  line  passing  through  £  (the  center  of 
the  ellipsoid)  and  parallel  to  the  ci  axis.  The  sensitivity  with  respect  to  Is  related  to  the  distance, 

Ij,  from  the  center  of  the  ellipsoid  to  the  Intercept  of  this  line  and  the  ellipsoid.  We  call  this  distance 
the  Insensitivity  with  respect  to  ci<  Figure  (11.2-1)  shows  the  construction  of  the  Insensitivities  with 
respect  to  Ci  and  c»  on  a  two-dimensional  example.  The  relationship  between  the  sensitivity  and  the  Insensi¬ 
tivity  Is 


-  (S^*1'*  -  A1/*  (11.2-3) 

which  follows  Immediately  from  Equation  (11.1-17)  for  the  confidence  ellipsoid,  and  Equation  (11.2-1)  for  the 
sensitivity. 

We  can  rephrase  the  geometric  Interpretation  of  the  Insensitivity  as  follows:  the  Insensitivity  with 
respect  to  tj  Is  the  largest  change  that  we  can  make  In  the  1th  element  of  £  and  still  remain  within  the 
confidence  ellipsoid.  All  other  elements  of  i  are  constrained  to  remain  equal  to  their  estimates  values 
during  this  search;  that  Is,  the  search  Is  constrained  to  a  line  parallel  to  the  Ci  axis  passing  through  £. 

From  the  statistical  viewpoint,  the  Insensitivity  with  respect  to  d  1s  an  approximation  to  the  standard 

deviation  of  ei,  the  corresponding  component  of  the  error,  conditioned  on  all  of  the  other  components  of  the 

error.  We  can  see  this  by  recalling  the  results  from  Chapter  3  on  conditional  Gaussian  distributions.  If  the 

covariance  of  e  1$  A,  then  the  covariance  of  ei  conditioned  on  all  of  the  other  components  Is 

[ ( A-1 ) 1 1  ]-1 ;  therefore,  the  conditional  standard  deviation  Is  [(A*l)ll3‘1' *.  From  Equations  (11.2-2) 
and  (11.2-3),  we  can  see  that  this  expression  equals  the  Insensitivity.  Note  that  the  conditioning  on  the 
other  elements  In  the  statistical  viewpoint  corresponds  directly  to  the  constraint  on  the  other  elements  In  the 
geometric  viewpoint. 

A  sensitivity  analysis  will  detect  one  of  the  most  obvious  kinds  of  estimation  difficulty- parameters 
which  have  little  or  no  effect  on  the  system  response.  If  a  parameter  has  no  effect  on  the  system  response, 
then  It  should  be  obvious  that  the  system  response  data  give  no  basis  for  an  estimate  of  the  parameter;  In 
statistical  terms,  the  system  Is  unidentifiable.  Similarly,  If  a  parameter  has  little  effect  on  the  system 
response,  then  there  Is  little  basis  for  an  estimate  of  the  parameter;  we  can  expect  the  estimates  to  be 
Inaccurate. 

Checking  for  parameters  which  have  no  effect  on  the  system  response  may  seem  like  an  academic  exercise, 
considering  that  practical  problems  would  not  be  likely  to  have  such  Irrelevant  parameters.  In  fact,  this 
seemingly  trivial  difficulty  Is  extremely  common  In  practical  applications.  It  can  arise  from  typographical 
or  other  errors  In  Input  to  computer  programs.  Perhaps  the  most  common  example  of  this  problem  is  attempting 
to  estimate  the  effect  of  an  Input  which  Is  Identically  zero.  The  Input  might  either  be  validly  zero.  In  which 
case  Its  effect  cannot  be  estimated,  or  the  Input  signal  might  have  been  destroyed  or  misplaced  by  sensor  or 
programnlng  problems. 

The  sensitivity  Is  a  reasonable  Indicator  of  accuracy  only  when  we  are  estimating  a  single  parameter, 
because  the  estimates  of  other  parameters  are  never  exact,  as  the  sensitivity  analysis  assumes.  The  sensitiv¬ 
ity  analysis  Ignores  all  effects  of  correlation  between  parameters;  we  can  evaluate  the  sensitivity  with 
respect  to  a  parameter  without  even  knowing  what  other  parameters  are  being  estimated.  When  more  than  one 
parameter  Is  estimated,  the  sensitivity  gives  only  a  lower  bound  for  the  error  estimate.  The  error  band  Is 
always  at  least  as  large  as  the  sensitivity  regardless  of  what  other  parameters  are  estimated;  correlation 
effects  between  parameters  can  increase,  but  never  decrease,  the  error  band.  In  other  words,  high  sensitivity 
Is  a  necessary,  but  not  sufficient,  condition  for  an  accurate  estimate. 

In  practice,  correlation  effects  tend  to  Increase  the  error  band  so  much  that  the  sensitivity  Is  virtually 
useless  as  an  Indicator  of  accuracy.  The  sensitivity  analysis  Is  usually  useful  only  for  detecting  the  problem 
of  completely  Irrelevant  parameters.  The  sensitivity  will  not  Indicate  when  the  effect  of  a  parameter  is 
Indistinguishable  from  the  effects  of  other  parameters,  a  more  conmon  problem. 

11.2.2  Correlation 

We  noted  In  the  previous  section  that  correlations  among  parameters  result  In  much  larger  error  bands  than 
Indicated  by  the  sensitivities  alone.  The  Inadequacy  of  the  sensitivity  as  a  measure  of  estimate  accuracy  has 
led  to  the  widespread  use  of  the  statistical  correlations  to  Indicate  accuracy.  We  will  see  In  this  section 
that  the  correlations  also  give  an  Incomplete  picture  of  the  accuracy. 
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The  statistical  correlation  between  two  error  components  e^  and  ej  Is  defined  to  be 

corr(e.|,ej)  =  E<  e1e  j  }/vXE{e^  >t<ej }  J 

assuming  that  the  means  of  ej  and  ej  are  zero.  In  terms  of  A,  the  covariance  matrix  of  e,  the  correlation 
Is 


corr(e.| ,e j )  *  Ai jM^Ajj)  (11.2-5) 

Geometrically,  the  correlations  are  related  to  the  eccentricity  of  the  confidence  ellipsoid.  If  the  sen¬ 
sitivities  with  respect  to  all  of  the  unknown  parameters  are  equal  (which  we  can  always  arrange  by  a  scale 
change),  and  If  the  correlations  are  all  zero,  then  the  confidence  ellipsoid  is  spherical.  As  the  magnitudes 
of  the  correlations  become  larger,  the  eccentricity  of  the  scaled  ellipsoid  Increases.  The  magnitude  of  the 
correlations  can  never  exceed  1,  except  through  approximations  or  round-off  errors  In  the  computation. 

The  definition  above  Is  for  the  unconditional,  or  full  correlations.  Whenever  the  term  correlation 
appears  without  a  modifier,  it  Implicitly  means  the  unconditional  correlation.  We  can  also  define  conditional 
correlations,  although  they  are  less  commonly  used.  The  definition  of  the  conditional  correlation  Is  identical 
to  that  of  the  unconditional  correlations,  except  that  the  expected  values  are  all  conditioned  on  all  of  the 
parameters  other  than  the  two  under  consideration.  We  can  express  the  conditional  correlation  of  ej  and  ej 
as 


cond  corr(e.,ej)  =  -r1j//(r1 1-T j j )  (11.2-6) 

where  r  =  A-1.  This  i'  similar  to  the  expression  for  the  unconditional  correlation,  the  difference  being  that 
r  replaces  A  and  the  sign  is  changed. 

If  there  are  only  two  unknowns,  the  conditional  and  unconditional  correlations  are  identical.  If  there 
are  more  than  two  unknowns,  the  conditional  and  unconditional  correlations  can  give  quite  different  pictures. 
Consider  the  case  In  which  r  is  an  N-by-N  matrix  with  l's  on  the  diagonal  and  with  all  of  the  off-diagonal 

elements  equal  to  X.  As  X,  the  conditional  correlation,  approaches  -1/ (N  -  1),  the  full  correlation 

approaches  1.  In  the  limit,  when  X  equals  -1/(N  -  1),  the  r  matrix  Is  singular.  Thus,  for  large  N,  the 

full  correlations  can  be  quite  high  even  when  all  of  the  conditional  correlations  are  low.  This  same  example 

inverts  to  show  that  the  converse  also  Is  true. 

There  are  three  objections  to  using  the  correlations,  full  or  conditional,  as  primary  Indicators  of  accu¬ 
racy.  First,  although  the  correlations  give  Information  about  the  shape  of  the  confidence  ellipsoid,  they 
completely  ignore  its  size.  Figure  (11.2-2)  shows  two  confidence  ellipsoids.  Ellipse  A  is  completely  con¬ 
tained  within  ellipse  B  and  Is,  therefore,  clearly  preferable;  yet  ellipse  B  has  zero  correlation  and 
ellipse  A  has  significant  correlation.  From  this  example.  It  is  obvious  that  accurate  estimates  can  have 
high  correlations  and  poor  estimates  can  have  low  correlations.  To  evaluate  the  accuracy  of  the  estimates, 
you  need  Information  about  the  sensitivities  as  well  as  about  the  correlations;  neither  alone  is  adequate. 

As  a  more  concrete  example  of  the  interplay  between  correlation  am  sensitivity,  consider  a  scalar  linear 
system: 


z(t.)  =  Du^)  +  H  (11.2-7) 

We  wish  t'j  estimate  D.  Both  D  and  the  bias  H  are  unknown.  The  input  u(tj)  is  an  angular  position  of 
some  control  device.  Suppose  that  the  input  time-history  is  as  shown  in  Figure  (11.2-3).  A  large  portion  of 
the  energy  in  this  input  is  from  the  steady-state  value  of  90°;  the  energy  in  the  pulse  is  much  smaller.  This 
input  is  highly  correlated  with  a  constant  bias  Input.  Therefore,  the  estimate  of  D  will  be  highly  corre¬ 
lated  with  the  estimate  of  H.  (If  this  point  is  not  obvious,  we  can  choose  a  few  time  points  on  the  figure 
and  compute  the  corresponding  covariance  matrix.)  The  sensitivity  with  respect  to  D  is  high;  because  of  the 
large  values  of  u,  small  changes  in  D  cause  large  changes  in  z. 

Now  we  consider  the  same  system,  with  the  input  shown  in  Figure  (11.2-4).  Both  the  correlation  and  the 
sensitivity  are  much  lower  than  they  were  for  the  input  of  Figure  (11.2-3).  These  changes  balance  each  other, 
resulting  in  the  same  accuracy  in  estimating  0.  The  Inputs  shown  In  the  two  figures  are  identical,  but  mea¬ 
sured  with  respect  to  reference  axes  rotated  by  90°.  The  choice  of  reference  axis  is  a  matter  of  convention 
which  should  not  affect  the  accuracy;  it  does,  however,  affect  both  the  sensitivity  and  correlation. 

This  example  illustrates  that  the  correlation  alone  is  not  a  reasonable  measure  of  accuracy.  By  redefin¬ 
ing  the  reference  axis  of  the  input  in  this  example,  we  can  change  the  correlation  at  will  to  any  value  between 
-1  and  1. 

The  second  objection  to  the  use  of  correlations  as  indicators  of  accuracy  is  more  serious  because  it 
cannot  be  answered  by  simply  looking  at  sensitivities  and  correlations  together.  In  the  same  way  that  sensi¬ 
tivities  are  one-dimensional  tools,  correlations  are  two-dimensional  tools.  The  utility  of  a  tool  restricted 
to  two-dimensional  subspaces  Is  limited.  Three  simple  examples  of  idealized  but  realistic  situations  serve  to 
illustrate  the  dimensional  limitations  of  the  correlations.  These  examples  Involve  free  lateral-directional 
oscillation  of  an  aircraft. 

For  the  flr^t  example,  there  Is  a  yaw-rate  feedback  to  the  rudder  and  a  rudder-to-aileron  Interconnect. 
Thus  the  aileron  and  rudder  signals  are  both  proportional  to  yaw  rate.  In  this  case,  the  conditional  correla¬ 
tions  of  the  aileron,  rudder,  and  yaw-rate  derivatives  are  1  (or  nearly  so  with  imperfect  data).  Conditioned 
on  the  aileron  derivatives  being  known  exactly,  changes  in  the  rudder  derivative  estimates  con  be  exactly 
compensated  for  by  changes  in  the  yaw-rate  derivative  estimates;  thus,  the  conditional  correlation  Is  1.  The 
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unconditional  correlations,  however,  are  easily  seen  to  be  only  1/2.  Changes  In  the  rudder  derivative  estl- 
mates  must  be  compensated  for  by  some  combination  of  changes  In  the  aileron  and  yaw-rate  derivative  estimates. 
Since  there  are  no  constraints  on  how  much  of  the  compensation  must  come  from  the  aileron  and  how  much  from 
the  yaw-rate  derivative  estimates,  the  unconditional  correlations  would  be  1/2  (because,  on  the  average, 

1/2  of  the  compensation  would  come  from  each  source). 

For  the  second  example,  no  feedback  Is  present  and  there  Is  a  neutrally  damped,  dutch-roll  oscillation 
(or  a  wing  rock).  The  sideslip,  roll-rate,  and  yaw-rate  signals  are  thus  all  sinusoids  of  the  same  frequency, 
with  different  phases  and  amplitudes.  Taken  two  at  a  time,  these  signals  have  low  correlations.  The  condi¬ 
tional  correlations  consider  only  two  parameters  at  a  time,  and  thus  the  conditional  correlations  of  the 
derivatives  will  be  low.  Nonetheless,  the  three  signals  are  linearly  dependent  when  all  are  considered 
together,  because  they  can  all  be  written  as  linear  combinations  of  a  sine  wave  and  a  cosine  wave  at  the  dutch- 
roll  frequency.  The  unconditional  correlations  of  the  derivatives  will  be  1  (or  nearly  so  with  Imperfect 
data) . 

Both  of  the  above  examples  have  three-dimensional  correlation  problems,  which  prevent  the  parameters  from 
being  Identifiable.  The  conditional  correlations  are  low  In  one  case,  and  the  unconditional  correlations  are 
low  In  the  other.  Although  neither  alone  is  sufficient,  examination  of  both  the  conditional  and  unconditional 
correlations  will  always  reveal  three-dimensional  correlation  problems. 

For  the  third  example,  suppose  that  a  wing  leveler  feeds  back  bank  angle  to  the  aileron,  and  that  a 
neutrally  damped  dutch  roll  Is  present  with  the  feedback  on.  There  are  then  four  pertinent  signals  (sideslip, 
roll  rate,  yaw  rate,  and  aileron)  that  are  sinusoids  with  the  same  frequency  and  different  phases.  In  this 
case,  both  the  conditional  and  the  unconditional  correlations  will  be  low.  Nonetheless,  there  Is  a  correlation 
problem  which  results  In  unidentifiable  parameters.  This  correlation  problem  Is  four-dimensional  and  cannot 
be  seen  using  the  two-dimensional  correlations. 

The  full  and  conditional  correlations  are  closely  related  to  the  eigenvalues  of  2-by-2  submatrices  of  the 
A  and  r  matrices,  respectively,  normalized  to  have  unity  diagonal  elements.  Specifically,  the  eigenvalues  are 
1  plus  the  correlation  and  1  minus  the  correlation;  thus,  high  correlations  correspond  to  large  eigenvalue 
spreads.  Higher-order  correlations  would  be  investigated  using  eigenvalues  of  larger  submatrices.  Looked  at 
In  this  light,  the  Investigation  of  2-by-2  submatrices  is  revealed  as  an  arbitrary  choice  dictated  by  Its 
familiarity  more  than  by  any  objective  criterion.  The  eigenvalues  of  the  full  normalized  A  and  r  matrices 
would  seem  more  appropriate  tools.  These  eigenvalues  and  the  corresponding  eigenvectors  can  provide  some 
information,  but  they  are  seldom  used.  In  principle,  small  eigenvalues  of  the  normalized  r  matrix  or  large 
eigenvalues  of  the  normalized  a  matrix  Indicate  correlations  among  the  parameters  with  significant  components 
In  the  corresponding  eigenvectors.  Note  that  the  eigenvalues  of  the  unnormalized  r  and  A  matrices  are  of 
little  use  in  studying  correlations,  because  scaling  effects  tend  to  dominate. 

The  last  objection  to  the  use  of  the  correlations  Is  the  difficulty  of  presentation.  It  Is  Impractical  to 
display  the  estimated  correlations  graphically  In  a  problem  with  more  than  a  handful  of  unknowns.  The  most 
coninon  presentation  Is  simply  to  print  the  matrix  of  estimated  correlations.  This  option  offers  little 
improvement  in  comprehensibility  over  simply  printing  the  A  matrix.  If  there  are  a  large  number  of  experi¬ 
ments,  It  is  pointless  to  print  all  of  the  correlation  matrices.  Such  a  nongraphical  presentation  cannot 
reasonably  give  a  coherent  picture  of  the  system  analyzed. 

11.2.3  Cramer-Rao  Bound 

The  Cramer-Rao  bound  is  the  last  of  the  statistics  based  on  the  confidence  ellipsoid.  It  proves  to  be  the 
most  useful  of  these  statistics.  The  Cramer-Rao  bound  Is  often  referred  to  by  other  names,  Including  the 
standard  deviation  and  the  uncertainty  level.  We  will  consider  both  statistical  and  nonstatlstlcal  Interpreta¬ 
tions  of  the  Cramer-Rao  bound. 

The  Cramer-Rao  bound  of  an  estimated  scalar  parameter  Is  the  standard  deviation  of  the  error  In  that 
parameter.  Strictly  speaking,  the  term  Cramer-Rao  bound  applies  only  to  the  approximation  to  the  standard 
deviation  obtained  from  the  Cramer-Rao  Inequality.  For  the  purposes  of  this  section,  the  properties  are  simi¬ 
lar,  regardless  of  the  source  of  the  standard  deviation.  In  terms  of  the  A  matrix,  the  Cramer-Rao  bound  of 
the  ith  element  of  e  Is  (A^)1'2. 

The  Cramer-Rao  bound  Is  closely  related  to  the  insensitivity.  Both  are  standard  deviations  of  the  error, 

the  only  difference  being  that  the  Insensitivity  is  the  conditional  standard  deviation,  whereas  the  Cramer-Rao 

bound  Is  unconditional.  They  are  also  computationally  similar,  the  difference  being  in  whether  the  Inversion 
is  or  the  matrix  or  of  the  individual  element.  *• 

The  geometric  relationship  between  the  Cramer-Rao  bound  and  the  insensitivity  Is  particularly  revealing. 
The  Cramer-Rao  bound  on  e-j  Is  the  largest  change  that,  you  can  make  In  and  still  remain  within  the  confi¬ 
dence  ellipsoid.  During  this  search,  the  other  components  are  free  to  take  any  values  that  keep  the  point 
within  vhe  confidence  ellipsoid.  This  definition  Is  identical  to  the  geometric  definition  of  the  insensitiv¬ 
ity,  except  that  the  other  components  are  constrained  to  the  estimated  values  In  the  definition  of  insensitiv¬ 
ity.  This  constraint  is  directly  related  to  the  statistical  conditioning  In  the  definition  of  the  Insensitiv¬ 
ity;  the  Cramer-Rao  bound  has  no  such  constraints  and  is  an  unconditional  standard  deviation. 

The  Cramer-Rao  bound  must  always  be  at  least  as  large  as  the  Insensitivity,  because  releasing  a  constraint 

can  never  make  the  solution  of  a  maximization  problem  smaller.  This  fact  relates  to  our  previous  statement 
that  correlation  effects  can  increase,  but  never  decrease,  the  error  band  defined  by  the  Insensitivity. 

Figure  (11.2-5)  Illustrates  the  geometric  Interpretation  of  the  Cramer-Rao  bounds  and  Insensitivities  In  a 
two-dimensional  example. 

To  prove  that  the  Cramer-Rao  bound  Is  the  solution  to  the  above  optimization  problem,  we  will  state  and 
prove  a  more  general  result.  (The  general  result  Is  actually  easier  to  prove.) 
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Theorem  11.2-1  Given  a  fixed  vector  x  and  a  positive  definite  symmetric 
matrix  H,  the  maximum  of  x*y,  subject  to  the  constraint  that  x*Hx  s  1,  Is 
given  by  /(y*H"~yj. 

Proof  Since  x*y  has  no  unconstrained  local  extrema ,  the  solution  must 
lie  on  the  constraint  boundary;  therefore,  the  Inequality  In  the  con  tralnt 
can  be  replaced  by  an  equality.  This  constrained  optimization  problem  can 
be  restated  by  the  use  of  Lagrange  multipliers  (Luenberger,  1969)  as  the 
unconstrained  minimization  of 


f(x,x)  -  x*y  -  |  x(x*Hx  -  1) 

(11.2-8) 

where  x  Is  the  scalar  Lagrange  multiplier.  The  maximum  Is  found  by  setting 
the  gradients  to  zero  as  follows: 

0  •  vxf(x,x)  *  y  -  XHx 

(11.2-9) 

0  -  ^f(x.X)  -  -  \  (x*Hx  -  1) 

(11.2-10) 

From  Equation  (11.2-9)  we  have 

x  ■  X-lH_1y 

(11.2-11) 

Substituting  this  Into  Equation  (11.2-10)  gives 

y*H"1x_1Hx"1H“1y  -  1  -  0 

(11.2-12) 

or 

x*VH*ly  *  l 

(11.2-13) 

or 

X  -  /(y-H-iy) 

(11.2-14) 

Substituting  Into  Equation  (11.2-11)  gives 

(11.2-15) 

and  thus 

x*y  .  ^*H-^  . 

/trtHy) 

at  the  solution.  This  Is  the  result  sought. 

The  specific  case  of  y  being  a  unit  vector  along  the  axis  gives  the  form  claimed  for  the  Cramer-Rao 
bound  of  the  tl  element. 

The  general  form  of  Theorem  (11.2-1)  has  other  applications.  The  value  of  any  linear  combination  of  the 
parameters  can  be  expressed  as  c*y  for  some  fixed  y-vector.  Thus  the  general  form  shows  how  to  evaluate 
the  accuracy  of  arbitrary  linear  combinations  of  parameters.  This  form  applies  to  many  situations  where  the 
sum,  difference,  or  other  combination  of  multiple  parameters  Is  of  Interest. 

On  the  basis  of  this  geometric  picture,  we  can  think  of  the  Cramer-Rao  bounds  as  Insensitivities  that  are 
computed  accounting  for  all  parameter  correlations.  The  computation  and  interpretation  of  the  Cramer-Rao 
bounds  are  valid  in  any  number  of  dimensions.  In  this  respect,  the  Cramer-Rao  bounds  contrast  with  the  Insen¬ 
sitivities,  which  are  one-dimensional  tools,  and  the  correlations,  which  are  two-dlmer.slonal  tools.  The 
Cramer-Rao  bounds  are  thus  the  best  of  the  theoretical  measures  of  accuracy  that  can  be  evaluated  for  a  single 
experiment. 


11.3  OTHER  MEASURES  OF  ACCURACY 

The  previous  sections  have  discussed  the  Cramer-Rao  bound  and  other  accuracy  statistics  based  on  the  con¬ 
fidence  ellipsoid.  Although  the  Cramer-Rao  bound  Is  the  best  single  analytical  measure  of  accuracy,  over¬ 
reliance  on  any  single  source  of  accuracy  data  Is  dangerous.  Uncritical  use  of  the  Cramer-Rao  bound  can  give 
extremely  misleading  results  In  realistic  situations,  as  discussed  by  Maine  and  Iliff  (1981b).  This  section 
discusses  alternate  accuracy  measures,  which  can  supplement  the  Cramer-Rao  bound. 

11.3.1  Bias 

The  bias  of  an  estimator  Is  occasionally  cited  as  an  Indicator  of  accuracy.  We  do  not  consider  It  a 
useful  Indicator  In  most  circumstances.  This  section  Is  limited  to  a  brief  exposition  of  the  reasons  for  this 
judgment. 


Section  4.2.1  defines  the  bias  of  an  estimator.  Bias  arises  from  several  sources.  Some  estimators  are 
Intrinsically  biased,  regardless  of  the  nature  of  the  data.  Random  noise  In  the  data  often  causes  a  bias. 

The  bias  from  random  noise  sometimes  goes  to  zero  asymptotically  for  estimators  matched  to  the  noise  character¬ 
istics.  Finally,  the  Inevitable  modeling  errors  In  analyzing  real  systems  cause  all  estimators  to  be  biased, 
even  asymptotically.  Host  discussions  of  bias  refer.  Implicitly  or  explicitly,  to  asymptotic  bias.  Even  for 
Idealized  cases  with  no  modeling  error,  estimators  are  seldom  unbiased  for  finite  time. 

There  are  two  reasons  why  the  bias  1$  of  minimal  use  as  a  measure  of  accuracy.  First,  the  bias  reflects 
only  the  consistent  errors;  It  Ignores  random  scatter.  As  Illustrated  In  Section  4.2.1,  It  Is  possible  for  an 
estimator  to  give  ludicrous  Individual  estimates  which  average  out  to  a  small  or  zero  bias.  This  property  Is 
Intrinsic  to  the  definition  of  the  bias. 

Second,  the  bias  Is  difficult  to  compute  In  most  cases.  If  we  could  compute  the  bias,  we  could  subtract 
it  from  the  estimates  to  obtain  revised  estimates  that  were  unbiased.  (Some  estimators  use  this  technique.) 

In  some  cases.  It  may  be  practical  to  compute  a  bound  on  the  magnitude  of  the  bias  from  a  particular 
source,  even  when  we  cannot  compute  the  actual  bias.  Although  they  are  rarely  used,  such  bounds  can  give  a 
reasonable  Indication  of  the  likely  magnitude  of  the  error  from  some  sources.  This  Is  the  most  constructive 
use  of  bias  Information  In  evaluating  accuracy. 

In  contrast,  the  often-repeated  statements  that  a  given  estimator  Is  or  Is  not  asymptotically  unbiased 
are  of  little  practical  use.  Most  of  the  estimators  considered  In  this  document  are  asymptotically  unbiased 
when  the  assumptions  used  In  the  derivation  are  true.  The  statement  that  other  estimators  are  biased  under  the 
same  conditions  amounts  to  a  restatement  of  the  universal  principle  that  estimators  are  biased  In  the  presence 
of  modeling  error.  Thus  arguments  about  which  of  two  estimators  Is  biased  are  silly.  These  arguments  reduce 
to  the  Issue  of  what  assumptions  to  use,  an  Issue  best  addressed  directly. 

Although  quantitative  measures  of  bias  may  not  be  available,  the  analyse  should  always  consider  the  Issue 
of  bias  due  to  modeling  error.  Bias  errors  are  added  to  all  other  types  of  error  In  the  estimates.  Unfor¬ 
tunately,  some  bias  errors  are  Impossible  to  detect  solely  by  analyzing  the  data.  The  estimates  can  be 
repeatable  with  little  scatter  and  appear  to  be  accurate  by  all  other  measures,  and  still  have  large  bias 
errors.  An  example  of  this  type  of  problem  Is  a  calibration  error  In  a  nonredundant  Instrument.  The  only  way 
to  avoid  such  problems  Is  to  be  meticulous  In  executing  and  documenting  every  step  of  the  application.  Includ¬ 
ing  modeling.  Instrumentation,  and  data  handling.  No  automatic  tests  exist  that  adequately  substitute  for 
such  care. 

11.3.2  Scatter 


When  there  are  several  experiments  at  the  same  condition,  the  scatter  of  the  estimates  Is  an  Indication 
of  accuracy.  We  can  also  evaluate  scatter  about  a  smooth  fairing  of  the  estimates  In  a  series  of  experiments 
with  gradually  changing  conditions.  This  approach  assumes  that  the  parameters  change  smoothly  as  a  function 
of  experimental  condition. 

'.he  scatter  has  a  significant  advantage  over  many  of  the  theoretical  measures  of  accuracy  discussed  below. 
The  scatter  measures  the  actual  performance  that  some  of  the  theoretical  measures  are  trying  to  predict. 
Therefore  the  scatter  includes  several  effects,  such  as  random  errors  In  measuring  the  experiment  conditions, 
that  are  Ignored  In  the  theoretical  predictions.  You  can  gain  the  most  Information,  of  course,  by  considering 
both  the  observed  scatter  and  the  theoretical  predictions. 

An  inherent  weakness  In  the  use  of  scatter  as  a  gauge  of  accuracy  Is  that  several  data  points  are  required 
to  define  It.  Depending  on  the  application,  this  objection  can  range  from  Inconsequential  *o  Insurmountable. 

A  related  problem  Is  that  the  scatter  does  not  show  the  accuracy  of  Individual  points,  some  of  which  may  be 
better  than  others.  For  instance.  If  only  two  conflicting  data  points  are  available,  the  scatter  gives  no  hint 
as  to  which  is  more  reliable.  Figure  (11.3-1)  shows  estimates  of  the  parameter  Cnp  obtained  from  flight  data 
of  a  PA-30  aircraft.  The  scatter  Is  large,  showing  estimates  of  both  signs. 

Figure  (ll.?-2)  shows  the  same  data  segregated  Into  rudder  and  aileron  maneuvers.  In  this  case,  the 
scatter  makes  It  evident  that,  the  aileron  maneuvers  result  in  far  mere  consistent  estimates  of  Cnp  than  do 
the  rudder  maneuvers.  Had  there  been  only  one  or  two  aileron  and  one  or  two  rudder  maneuvers  available,  there 
would  have  been  no  way  to  deduce  from  the  scatter  tta.t  the  aileron  maneuvers  were  superior  for  estimating  this 
parameter. 

The  scacter  shares  a  weakness  with  most  of  the  theoretical  accuracy  measures  In  that  It  does  not  account 
for  consistent  errors  (i.e.,  biases).  Many  occurrences  can  result  in  small  scatter  about  an  Incorrect  value. 
The  scatter,  therefore,  should  be  regarded  as  a  lower  bound.  The  estimates  can  be  worse  than  Is  indicated  by 
the  scatter,  but  are  seldom  better. 

Maine  and  111 f f  (1981b)  discuss  well -documented  situations  in  which  the  scatter  is  significantly  larger 
than  the  Cramer-Rao  bounds.  In  all  such  cases,  we  regard  the  scatter  as  a  more  realistic  measure  of  the  mag¬ 
nitude  of  the  errors.  The  Cramer-Rao  bound  Is  still  a  reasonable  means  of  determining  which  Individual  experi¬ 
ments  are  most  accurate,  but  may  not  give  a  reasonable  magnitude  of  the  error. 

In  spite  of  Its  problems,  the  data  scatter  Is  an  easily  used  tool  for  evaluating  accuracy,  and  It  should 
always  be  examined  when  sufficient  data  points  are  available  to  define  It. 

11.3.3  Engineering  Judgment 

Engineering  judgment  Is  the  oldest  measure  of  estimate  reliability.  Even  with  the  theoretical  accuracy 
measures  now  available,  the  need  for  judgment  remains;  the  theoretical  measures  are  merely  tools  which  supply 
more  Information  on  which  to  base  the  judgment.  By  definition,  the  process  of  applying  engineering  judgment 
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cannot  be  described  precisely  and  quantitatively,  or  there  Mould  be  no  judgment  Involved.  Algorithms  can  be 
devised  to  search  for  specific  problems,  but  the  engineer  still  needs  to  make  a  final  unautomated  judgment. 
Therefore,  this  section  Mill  simply  list  sr-me  of  the  factors  most  often  considered  In  making  a  judgment. 

One  of  the  most  basic  factors  In  judging  the  accuracy  of  the  estimates  Is  the  anticipated  accuracy.  The 
engineer  usually  has  a  priori  knoMledge  of  horn  accurately  one  can  reasonably  expect  to  be  able  to  estimate 
the  parameters.  This  knoMledge  can  be  based  on  previous  experience,  aMareness  of  the  relative  Importance  end 
linear  dependence  of  the  parameters,  and  the  quality  of  experimental  data  obtained. 

Another  basic  criterion  Is  the  reasonability  of  the  estimated  parameter  values.  Before  analysis  Is  begun. 
Me  usually  know  the  approximate  range  of  values  of  the  parameters.  Drastic  deviations  from  this  range  are 
reason  to  suspect  the  estimates  unless  Me  discover  the  reason  for  the  poor  prediction  or  Me  Independently 
verify  the  suspect  value. 

Me  have  previously  mentioned  the  role  of  engineering  judgment  In  evaluating  model  adequacy.  The  engineer 
must  look  for  violations  of  specific  assumptions  made  In  deriving  the  model,  and  for  unexplained  problems  that 
may  Indicate  modeling  errors.  Both  the  estimator  and  the  theoretical  measures  of  accuracy  can  be  Invalidated 
by  modeling  errors.  The  magnitude  of  the  modeling-error  effects  must  be  judged. 

The  engineer  judges  the  quality  of  the  fit  of  the  measured  and  estimated  time  histories.  The  characteris¬ 
tics  of  this  fit  can  give  Indications  of  many  problems.  Many  modeling  error  problems  first  become  apparent  as 
poor  time-history  fits.  Failed  sensors  and  data  processing  errors  or  omissions  are  among  the  other  classes  of 
problems  Mhlch  can  be  deduced  from  the  fits. 

Finally,  engineering  judgment  is  used  to  assemble  and  Melgh  all  of  the  available  Information  about  the 
estimates.  You  must  combine  the  judgmental  factors  with  information  from  the  theoretical  tools  In  order  to 
give  a  final  best  estimate  of  the  parameters  and  of  their  accuracies. 


11.4  MODEL  STRUCTURE  DETERMINATION 

In  the  previous  sections.  Me  have  largely  assumed  that  the  assumed  model  form  Is  correct.  This  Is  never 
strictly  true  In  practice.  Therefore,  Me  must  alMays  consider  the  possible  effects  of  modeling  error  as  a 
special  Issue.  The  tools  discussed  In  Section  11.3  can  help  In  the  evaluation  of  these  effects. 

In  this  section.  Me  specifically  examine  the  question  of  determining  the  best  model  structure  for  param¬ 
eter  estimation.  One  approach  to  minimizing  the  effects  of  model  structure  errors  Is  to  use  a  model  structure 
which  Is  close  to  that  of  t:*  true  system.  There  are,  however,  definite  limits  to  this  principle.  The  limita¬ 
tions  arise  both  In  how  accurate  you  can  make  the  model  and  In  how  accurate  you  should  make  It. 

In  the  field  of  simulation.  It  Is  almost  axiomatic  that  the  simulation  fidelity  Improves  as  more  detail 
1 ,  added  to  the  model.  Practical  considerations  of  cost  and  the  degree  of  required  fidelity  dictate  the  level 
of  detail  included  in  the  model.  Simulation  and  system  Identification  are  closely  related  fields,  and  we 
might  expect  that  such  a  basic  principle  would  be  common  to  both.  Contrary  to  this  expectation,  system  Identi¬ 
fication  sometimes  obtains  better  results  from  a  simple  than  from  a  detailed  model.  The  use  of  too  detailed  a 
model  Is  probably  one  of  the  most  common  sources  of  difficulty  In  the  practical  application  of  system 
Identification. 

The  problems  that  arise  from  too  detailed  a  model  are  best  Illustrated  by  a  simple  example.  Presume  that 
Figure  (11.4-1)  shows  experimental  data  from  a  system  with  a  scalar  Input  U,  and  a  scalar  output  Z.  The  line 
In  the  figure  is  the  best  linear  fit  to  the  data.  This  line  appears  to  be  a  reasonable  representation  of  the 
system. 

To  Investigate  possible  nonlinear  effects,  consider  the  case  of  polynomial  models.  It  Is  obvious  that  the 
error  between  the  model  output  and  the  experimental  data  will  become  smaller  as  the  order  of  the  model 
Increases.  High-order  polynomials  Include  lower-order  polynomials  as  specific  cases  (we  have  no  requirement 
that  the  high-order  coefficient  be  nonzero),  so  the  best  second-order  fit  Is  at  least  as  good  as  the  best 
linear  fit,  and  so  forth.  When  the  order  of  the  polynomial  becomes  one  less  than  the  number  of  data  points, 
the  model  will  exactly  match  the  experimental  data  (unless  Input  values  were  repeated). 

Figure  (11.4-2)  shows  such  a  perfect  match  of  the  data  from  Figure  (11.4-1).  Although  the  data  points  are 
matched  perfectly,  the  curve  oscillates  wildly.  The  simple  linear  fit  of  Figure  (11.4-1)  Is  probably  a  much 
better  representation  of  the  system,  even  though  the  model  of  Figure  (11.4-2)  Is  more  detailed.  We  could  say 
that  the  model  of  Figure  (11.4-2)  Is  fitting  the  noise  Instead  of  the  true  response. 

Essentially,  as  the  model  complexity  Increases,  and  more  unknown  parameters  are  estimated,  the  problem 
approaches  the  black-box  system-identification  problem  where  there  are  no  assumptions  about  the  model  form.  We 
have  previously  shown  that  the  pure  black-box  problem  Is  Insoluble.  One  can  deduce  only  a  finite  amount  of 
Information  about  the  system  from  a  finite  amount  of  experimental  data.  The  engineer  provides,  In  the  form  of 
an  assumed  model  structure,  the  rest  of  the  Information  required  to  solve  the  system-identification  problem. 

As  the  assumed  model  structure  becomes  more  general.  It  provides  less  Information,  and  thus  more  of  the  Infor¬ 
mation  must  be  deduced  from  the  experimental  data.  Eventually,  one  reaches  a  point  where  the  Information 
available  is  Insufficient;  the  estimation  algorithms  then  perform  poorly,  giving  ridiculous  results. 

The  Cramer-Rao  bound  gives  a  statistical  basis  for  estimating  whether  the  experimental  data  contain  suffi¬ 
cient  Information  to  reliably  estimate  the  parameters  in  a  model.  This  and  related  statistics  can  be  used  to 
determine  the  number  and  selection  of  terms  to  Include  In  the  model  (Klein  and  Batterson,  1983;  Gupta,  Hall, 
and  Trankle,  1978;  and  Trankle,  Vincent,  and  Franklin,  1982).  The  basic  principle  is  to  Include  In  the  model 
only  those  terms  that  can  be  accurately  estimated  from  the  available  experimental  data.  This  process,  known  as 
model  structure  determination,  is  described  In  further  detail  In  the  cited  references.  We  will  restrict  our 
discussion  to  the  general  nature  and  applicability  of  model  structure  determination. 


118 


Automatic  model  structure  determination  Is  often  viewed  as  a  panacea  that  eliminates  the  necessity  for 
model  selection  to  be  based  on  engineering  Judgment  and  knowledge  of  the  phenomenology  of  the  system.  Since 
we  have  repeatedly  emphasized  that  pure  black-box  system  Identification  Is  Impossible,  such  claims  for  auto¬ 
matic  model  determination  must  be  viewed  with  suspicion. 

There  Is  a  basic  fallacy  In  the  argument  that  automatic  model  structure  determination  can  replace  engi¬ 
neering  judgment  In  selecting  a  model.  The  model  structure  determination  algorithms  are  not  creative;  they  can 
only  test  candidate  models  suggested  by  the  engineer.  In  fact,  the  model  structure  determination  algorithms 
are  a  type  of  parameter  estimation  In  disguise,  In  which  the  parameter  Is  an  Index  Indicating  which  model  Is  to 
be  used.  In  a  way,  model  structure  determination  Is  easier  than  most  parameter  estimation.  At  each  stage, 
there  are  only  two  possible  values  for  a  term,  zero  or  nonzero;  whereas  most  parameter  estimation  demands  that 
a  specific  value  be  picked  from  the  entire  real  line.  This  task  does  not  approach  the  scope  of  the  black-box 
system-identification  problem  In  which  the  number  of  possible  models  Is  a  high  order  of  Infinity. 

Engineering  Judgment  Is  still  needed,  therefore,  to  select  the  types  of  candidate  models  to  be  tested.  If 
the  candidate  models  are  not  appropriate,  the  results  will  be  questionable.  The  very  best  that  could  be 
expected  from  an  automatic  algorithm  In  this  circumstance  would  be  rejection  of  all  of  the  candidates  (and  not 
all  automatic  tests  have  even  that  much  capability.  No  automatic  algorithm  can  suggest  creative  Improvements 
that  It  has  not  been  specifically  programed  for. 

Consider  a  system  with  an  actual  output  of  Z  *  sln(U).  Assume  that  a  polynomial  model  has  been  selected 
by  the  engineer,  and  automatic  structure  determination  has  been  used  to  determine  what  order  polynomial  to  use. 
The  task  Is  hopeless  In  this  form.  The  data  can  be  fit  arbitrarily  well  with  a  polynomial  of  a  high  enough 
order,  but  the  polynomial  form  does  not  describe  the  essence  of  the  system.  In  particular,  the  finite  poly¬ 
nomial  will  not  be  valid  for  extrapolating  system  performance  outside  of  the  range  of  the  experimental  data. 

In  the  above  system,  consider  three  ranges  of  U-values:  |U|  <  0.1,  |U|  <  1.0,  ;.nd  |U|  <  10.0.  In  the 
range  |U|  <  0.1,  the  linear  polynomial  Z  ■  U  Is  a  close  approximation,  as  shown  In  Figure  (11.4-3).  The 
extrapolation  of  this  approximation  to  the  range  |U|  <  1.0  introduces  noticeable  errors,  as  shown  In 
Figure  (11.4-4).  Over  this  range,  the  approximation  Z  =  U  -  U’/6  Is  reasonable.  If  we  expand  our  view  to 
the  range  [U |  <  10.0,  as  In  Figure  (1.5-5),  then  neither  the  linear  nor  the  third-order  polynomial  Is  at  all 
representative  of  the  sine  function.  It  would  require  at  least  a  seventh-order  polynomial  to  match  even  the 
gross  characteristics  of  the  sine  function  over  this  range;  a  good  match  would  require  a  still  higher  order. 

Another  problem  with  automatic  model -structure  determination  Is  that  It  gives  only  a  statistical  estimate. 
Like  all  estimates,  it  Is  Imperfect.  If  no  better  Information  Is  available.  It  is  appropriate  to  use  auto¬ 
matic  model  structure  determination  as  the  best  guess.  If,  however,  facts  about  the  model  structure  are 
deduclble  from  the  physics  of  the  system.  It  Is  silly  to  throw  away  known  facts  and  use  Imperfect  estimates. 
(This  Is  one  of  the  most  basic  principles  In  the  entire  field  of  system  Identification,  not  jnst  In  model 
structure  determination:  If  a  fact  Is  known,  use  It  and  save  the  estimation  theory  for  cases  In  which  it  Is 
needed.) 

The  most  basic  problem  with  automatic  model  structure  determination  lies  In  the  statement  of  the  problem. 
The  very  term  "model  structure  determination"  Is  misleading,  because  there  Is  seldom  a  correct  model  to  deter¬ 
mine.  Even  when  there  Is  a  correct  model.  It  may  be  far  too  complicated  for  practical  purposes.  The  real 
model  structure  determination  problem  Is  not  to  determine  some  nonexistent  "correct"  model  structure,  but  to 
determine  an  adequate  model  structure.  We  discussed  the  Idea  of  adequate  models  In  Section  1.4;  the  Idea  of 
an  adequate  model  structure  Is  an  Intimate  part  of  the  Idea  of  an  adequate  model. 

This  basic  Issue  Is  addressed  briefly.  If  at  all.  In  most  of  the  literature  on  model  structure  determina¬ 
tion.  Many  papers  generate  simulated  data  with  a  specified  model,  and  then  demonstrate  that  a  proposed  model 
structure  determination  algorithm  can  determine  the  correct  model.  This  approach  has  little  to  do  with  the 
real  issue  In  model  structure  determination. 

The  previous  paragraphs  have  emphasized  the  numerous  problems  of  automatic  model  structure  determination. 
That  these  problems  exist  does  not  mean  that  automatic  model -structure  determination  Is  worthless,  only  that 
the  mindless  application  of  It  Is  dangerous.  Automatic  model  structure  determination  can  be  a  valuable  .ool 
when  used  with  an  appreciation  of  Its  limitations.  Most  good  model  structure  determination  programs  allow  the 
engineer  to  override  the  statistical  decision  and  force  specific  terms  to  be  Included  or  omitted.  This 
approach  makes  good  use  of  both  the  theory  and  the  judgment,  so  that  the  theory  is  used  as  a  tool  to  aid  the 
judgment  and  to  warn  against  some  types  of  poor  judgment,  but  the  end  responsibility  lies  with  the  engineer. 


11.5  EXPERIMENT  DESIGN 

The  previous  discussion  has,  for  the  most  part,  assumed  that  a  specific  set  of  experimental  data  has 
already  been  gathered.  In  some  cases,  this  Is  a  valid  assumption.  In  other  cases,  the  opportunity  is  avail¬ 
able  to  specify  the  experiments  to  be  performed  and  the  measurements  to  be  taken.  This  section  gives  a  brief 
overview  of  the  subject  of  designing  experiments  for  parameter  identification.  We  leave  detailed  discussion 
to  works  cited  In  the  references. 

Methods  for  experiment  design  fall  into  two  major  categories.  The  first  category  Is  that  of  methods  based 
on  numerical  optimization.  Such  methods  choose  an  Input,  subject  to  appropriate  constraints,  which  minimizes 
the  Cramer-Rao  bound  or  some  related  error  estimate.  Goodwin  (1982)  and  Plaetschke  and  Schulz  (1979)  give 
theoretical  and  practical  details  of  some  optimization  approaches  to  Input  design. 

Experiment  design  Is  often  strongly  constrained  by  practical  considerations;  In  the  extreme  case,  the 
constraints  completely  specify  the  input,  leaving  no  latitude  for  design.  In  a  design  based  on  numerical  opti¬ 
mization,  the  constraints  must  be  expressed  mathematically.  This  derivation  of  such  expressions  Is  sometimes 
straightforward,  as  when  a  control  device  Is  limited  by  a  physical  stop  at  a  specific  position.  In  other 
cases,  the  constraints  Involve  Issues  such  as  safety  that  are  difficult  to  quantify  as  precise  limits. 
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CHAPTER  12 


12.0  SUMMARY 

In  this  document.  we  have  presented  tha  thaoratlcal  background  of  statistical  estimators  for  dynamic 
systaat,  with  particular  emphasis  on  maximum-likelihood  estimators.  An  understanding  of  this  theoretical  back¬ 
ground  Is  crucial  to  tha  practical  application  of  tha  estimators;  tha  analyst  needs  to  know  the  capabilities 
and  limitations  of  the  estimators.  Thera  are  several  examples  of  artificially  complicated  problems  that  suc¬ 
cumb  to  simple  approaches,  and  seemingly  trivial  questions  that  have  no  answers. 

A  thorough  understanding  of  the  system  being  analyzed  Is  necessary  to  complement  this  theoretical  back¬ 
ground.  No  amount  of  theoretical  sophistication  can  compensate  for  the  lack  of  such  understanding.  The  entire 
theory  rests  on  the  basis  of  the  assumptions  made  about  the  system  characteristics.  The  theory  can  give  only 
limited  help  In  validating  or  refuting  such  assumptions. 

Errors  and  unexpected  difficulties  are  Inevitable  In  any  substantial  parameter  estimation  project.  The 
eventual  success  of  the  project  hinges  on  the  analyst's  ability  to  recognize  unreasonable  results  and  diagnose 
their  causes.  This  ability.  In  turn,  requires  an  understanding  of  both  estimation  theory  and  the  system  being 
analyzed.  Problems  can  range  from  obvious  Instrumentation  failures  to  subtle  modeling  Inconsistencies  and 
Identlflablllty  problems. 

Probably  the  most  difficult  part  of  parameter  estimation  Is  to  straddle  the  fine  line  between  models  too 
simple  to  adequately  represent  the  system  and  models  too  complicated  to  be  Identifiable.  There  Is  no  conser¬ 
vative  position  on  this  Issue;  excesses  In  either  direction  can  be  faUl.  The  solution  Is  typically  Iterative, 
using  diagnostic  skills  to  detect  problems  and  make  Improvements  until  an  adequate  result  Is  obtained.  The 
problem  Is  exacerbated  by  there  being  no  correct  answer. 

Neither  Is  there  a  single  correct  method  to  solve  parameter  estimation  problems.  Although  we  have  casti¬ 
gated  some  practices  as  demonstrably  poor,  va  make  no  attempt  to  establish  as  dogma  any  particular  method. 

The  material  of  this  document  Is  Intended  more  as  a  set  of  tools  for  parameter  estimation  problems.  The  selec¬ 
tion  of  the  best  tools  for  a  particular  task  Is  Influenced  by  factors  other  than  the  purely  theoretical. 

Better  results  often  come  from  a  crude,  but  adequate,  method  that  the  analyst  thoroughly  understands  than  from 
a  sophisticated,  but  unfamiliar,  method.  Me  recommend  the  attitude  expressed  by  Gauss  (1809,  p.  108): 

It  Is  always  profitable  to  approach  the  more  difficult  problems  In  several 
ways,  and  not  to  despise  the  good  although  preferring  the  better. 
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APPENOIX  A 


A. 3  MATRIX  RESULTS 


This  appendix  pmwti  several  matrix  results  used  In  the  body  of  the  book.  The  derivations  are  mostly 
exercises  In  simple  matrix  algebra.  Various  of  these  results  are  given  In  numerous  other  documents;  Goodwin 
and  Payne  (1977,  appendix  E)  present  most  of  them. 


A.  1  MATRIX  INVERSION  LEMMAS 

Consider  a  square,  nonsingular  matrix  a,  partitioned  as 


fAn  A„1 
LA*i  AjjJ 


where  Axx  and  a1J(  are  square.  Define  the  Inverse  of  A  to  be  r,  similarly  partitioned  as 


.r.P»  r-l 

Lr*i 


(A. 1-1) 


(A. 1-2) 


where  rxx  Is  the  same  size  as  AX1.  We  want  to  express  the  partitions  m  In  terms  of  the  Ajj.  To 
derive  such  expressions,  we  need  to  assume  that  either  Axx  or  A.,  Is  Invertible;  If  both  are  singular, 
Is  no  useful  form.  Consider  first  the  case  where  Axx  1$  Invertible. 

Leema  A. 1-1  Given  A  and  r  partitioned  as  In  Equations  (A. 1-1)  and  (A. 1-2), 


assume  thaT  A  and  A,x  arc  Invertible.  Then  (AIX  -  AxxAljAxt)  Is  Invertible 
and  the  partitions  ot  r  are  given  by 

rn  "  ‘u  '  AiiAxi(Ai»  *  AtxAxxAxl)  lAIXAx* 
rii  “  *AiiAxl(Alt  -  AxxAx*Axt)  1 
r*i  “  “(a*i  *  a,xa1Ja11)  1Aj1A1J 
r*a  "  ^AtJ  -  A|xAxxAlt) 

Proof  The  condition  Ar  ■  I  gives  the  four  equations 

Anrxi  +  Ax,rlt  *  0 
A*irn  +  AlXrIX  ■  0 
Axirn  +  Axlrlx  -  I 
Aixrxx  ♦  Axlr14  ■  I 

and  the  condition  ta  ■  I  gives  the  four  equations 

rnAx*  +  r^A**  “  0 

raiAxx  +  rxlAIX  “  0 

rn*\i  *  riiAjX  “  I 
rnAx*  +  r*iAxx  ■  I 

Equations  (A. 1-7)  and  (A. 1-1*1,  respectively,  give 

ri*  “  •AiiAxxrtl 
rn  ■  ”r**AnAxi 

Substitute  Equation  (A.l-15)  Into  Equation  (A. 1-10)  and  substitute  Equa¬ 
tion  (A. 1-16)  Into  Equation  (A. 1-14)  to  get 

(ax>  -  AxxAxjAxl)rtl  ■  I 


(A. 1-3) 
(A. 1-4) 
(A. 1-5) 
(A. 1-6) 


(A. 1-7) 
(A. 1-8) 
(A. 1-9) 
(A. 1-10) 


(A. 1-11) 
(A. 1-12) 
(A. 1-13) 
(A. 1-14) 


(A.l-15) 
(A. 1-16) 


(A. 1-17) 


ri*(A**  -  AxxAxjAxl)  ■  I 


(A. 1-18) 
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By  the  assumption  of  Invertlblllty  of  A,  the  r«j  Mist  and  satisfy 
Equations  (A. 1-7)  to  (A.l*14).  The  assumption  or invertlblllty  of  Axx  than 
as suras,  through  tha  above  substitutions,  that  rxl  satlsflas  Equa¬ 
tions  (A. 1-17)  and  (A. 1-18).  Tharafora  (A..  -  ax1a:}a,x)  Is  Invartlble  and 
rxx  Is  given  by  Equation  (A.  1-6). 

Substituting  Equation  (A. 1-6)  Into  Equations  (A. 1-15)  and  (A. 1-16)  glvas 
Equations  (A.  1-4)  and  (A. 1-5).  Finally,  substituting  Equation  (A. 1-5)  Into 
Equation  (A. 1-9)  and  solving  for  r..  glvas  Equation  (A. 1-3),  computing 
tha  proof. 

The  casa  where  Atl  Is  nonsingular  Is  simply  a  permutation  of  tha  same  lean*. 

Lamma  A. 1-2  Given  A  and  r  partitioned  as  In  Equations  (A. 1-1)  and  (A. 1-2), 
assume  tha?  A  and  A.t  are  Invertible.  Than  (axx  -  AllAj)All)  Is  Invertible 
and  the  partitions  or  r  are  given  by 

r41  *  (ijj  *  *ii*i|Aij)  1  (A. 1-19) 

ru  *  "(An  "  14i*A1J  (A.  1-20) 

rti  *  ■4»*A11(Axx  -  AjjAjJAjj)  1  (A. 1-21) 

^*i  m  ^»t  “  ^jjAjjtAxx  *  AXXAXXA1X)  1AllAiJ  (A. 1-22) 

Proof  Define  a  reordered  mstrlx 


The  Inverse  of  A'  Is  given  by  the  corresponding  reordering  of  r. 


Than  apply  tha  previous  lemma  to  A'  and  r1. 

When  both  Axx  and  Axx  ere  Invertible,  we  can  combine  the  above  lemmas  to  obtain  two  other  useful  results. 

lemma  A. 1-3  Assume  that  two  matrices  A  and  C  are  Invertible.  Further 
assume  that  one  of  the  expressions  (A  -  BC’lD)  or  (C  -  DA’XB)  Is  Invertible. 

Then  the  other  expression  1$  also  Invertible  and 

(A  -  BC-lD)-‘  «  A'1  -  A-XB(C  -  DA-XB)'XDA-X  (A. 1-23) 

Proof  Define  Ax,  »  A,  AX1  -  B,  Ax.  »  0,  and  In  order  to  apply 

Leninas  (A.  1-1)  and  (A.  1-2),  we  first  need  to  show  that  A  as  defined  by 
Equation  (A. 1-1)  Is  Invertible. 

If  (C  -  DA"XB)  Is  Invertible,  then  the  ra j  defined  by  Equations  (A. 1-3) 
to  (A. 1-6)  satisfy  Equations  (A. 1-7)  to  (A. 1-14) .  Therefore  A  Is  Invertible. 

Lenina  (A.  1-2)  then  gives  the  Invartlblllty  of  (A  -  BC”XD),  which  Is  one  of 
the  desired  results. 

Conversely,  If  we  assume  that  (A  -  BC"XD)  Is  Invertible,  then  the  j 
defined  by  Equations  (A. 1-19)  to  (A. 1-22)  satisfy  Equations  (A. 1-7)  to  (A. 1-14). 

Therefore  A  Is  Invertible  and  Lenina  (A. 1-1)  gives  the  Invertlblllty  of  the 
expression  (C  -  0A*xB). 

Thus  the  Invertlblllty  of  either  expression  Implies  Invertlblllty  of  the  other 
and  of  A.  We  can  now  apply  both  Lemmas  (A. 1-1)  and  (A. 1-2).  Equating  the 
expressions  for  rlx  given  by  Equations  (A. 1-3)  and  (A. 1-19),  and  putting  the 
result  In  terms  of  A,  B,  C,  and  D,  gives  Equation  (A. 1-23),  completing  the 
proof. 

Lenina  A.  1-4  Given  A,  B,  C,  and  D  as  In  lemma  (A. 1-3),  with  the  same 
Invertlblllty  asswnptlons,  then 

A'XB(C  -  0A"xB)"x  -  (A  -  BC_xD)_lBC_1  (A. 1-24) 

Proof  The  proof  Is  Identical  to  that  of  Lenina  (A.  1-3),  except  that  we  equate 
the  expressions  for  rix  given  by  Equations  (A. 1-4)  and  (A. 1-20),  giving 
Equation  (A. 1-24)  as  a  result. 
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A. 2  MATRIX  DIFFERENTIATION 


For  s*v*ral  of  the  following  results.  It  Is  convtnltnt  to  defln*  the  derivative  of  «  scslsr  with  resp- 
to  •  Mtrlx.  If  f  Is  •  scslsr  function  of  th*  Mtrlx  A,  w*  define  df/dA  to  b*  s  Mtrlx  with  elements 
equal  to  th*  derivatives  of  f  with  resp*ct  to  corresponding-elements  of  A. 


,(1J) 


d(A 


df 

TTTJTj 


(A. 2-1) 


Two  simple  relstlons  Involving  th*  tree*  function  sre  useful  In  Mnlpulstlng  th*  Mtrlx  snd  vtctor  quan- 
tltles  w*  work  with. 


R*sult  A. 2-1  If  x  snd  y  sre  two  rectors  of  th*  sow  length,  then 

x*y  ■  tr(yx*) 

Proof  Both  sides  expend  to 

Result  A. 2-2  If  A  snd  B  sre  two  Mtrlces  of  th*  ssm  size,  then 

iC  A(1,J)8*1,j*  -  tr(AB*) 
l.j 

Proof  Expend  the  right  side,  element  by  element. 


(A. 2-2) 


(A. 2-3) 


Both  of  these  results  sre  speclel  esses  of  the  ssm*  reletlonshlp  between  Inner  products  snd  outer  products. 
The  following  result  Is  s  pertlculsr  application  of  Result  (A. 2-2). 


Result  A. 2-3  If  f(A)  Is  s  scslsr  function  of  the  Mtrlx  A,  snd  A  Is  s 
function  of  the  scslsr  x,  then 


Proof  Use  the  chain  rule  with  the  Indlvldusl  elements  of  A  to  write 

i(1  »4) 


df  _  V  af  dA'1* 

^  ar 

i.j  8A 


(A. 2-4) 


(A. 2-5) 


Equation  (A. 2-4)  then  follows  from  Result  (A. 2-2)  and  the  definition  given 
by  Equation  (A.2-1). 

Result  A. 2-4  If  the  matrix  A  Is  a  function  of  x,  then 


®  <»“>  ■  »•* 


wherever  A  Is  Invertible. 

Proof  By  the  definition  of  the  Inverse 

AA’1  -  I 

Take  the  derivative,  using  the  chain  rule. 


&  («■*)  '  T, 


a;*“  ***<*“> -0 

Solving  for  d/dx(A_1)  gives  Equation  (A. 2-6),  as  desired. 

Result  A.2-5  If  A  Is  Invertible,  and  x  and  y  are  vectors,  then 

(x*A'ly)  -  -{A*1yx*A'1)* 

Proof  Use  result  (A. 2-4)  to  get 

aA 


(A. 2-6) 

(A. 2-7) 

(A. 2-8) 
(A. 2-9) 


aA 


TUI  (x*A‘>y)  -  -**A'>  A*‘y 


(A. 2-10) 


(A. 2-11) 


Now 


aA 


9  A ★ 

TT3T  Vj 


(A. 2-12) 


where  #1  Is  •  victor  with  aeros  In  *11  but  th*  1th  element,  which  Is  1. 
Therefor* 


— (x*A‘ly)  -  -x^A'^ejA^y  -  -•jA1yx*A‘l«1 

which  Is  th*  (1,J)  element  of  -(A'lyx*A*1)*.  Th*  definition  of  th*  Mtrlx 
dorlvstlv*  then  gives  Equation  (A. 2-10)  as  desired. 

Result  A. 2-6  If  A  Is  Invertible,  then 

it  W|A|  -  A*"1 

Proof  Expanding  th*  determinant  by  cofactors  of  the  1th  row  gives 
sn|A|  ■  sn  ^  A^,k)(ad4  A)^k*^ 

Taking  the  derivative  with  respect  to  A^'^  gives 


(mjaiM 

A'1,k'(adJ  A) 


TOT 


because  (adj  A)^*1^  does  not  depend  on  A^*^.  Using  Equation  (A. 2-15)  and 
the  expression  for  a  matrix  Inverse  In  terms  of  th*  matrix  of  cofactors,  we 
get 


Equation  (A. 2-14)  then  follows,  as  desired,  from  the  definition  of  the 
derivative  with  respect  to  a  matrix. 
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1979 

AFFTC-TIH-79-.' 

- 

USAFTPS  Flight  Test  Handbook.  Flying  Qualities: 

Theory  (Vol.l)  and  Flight  Test  Techniques  (Vol.2) 

1979 

AFFTC-TIM-8 1-1 

Rawlings,  K.,  Ill 

A  Method  of  Estimating  Upwash  Angle  at  Noseboom- 
Mounted  Vanes 

1981 

AFFTC-TIH-81-1 

Plews,  L.  and 

Mandt,  G. 

Aircraft  Brake  Systems  Testing  Handbook 

1981 

AFFTC-TIH-81-5 

DeAnda,  A.G. 

AFFTC  Standard  Airspeed  Calibration  Procedures 

1981 

AFFTC-TIH-81-6 

Lush, K. 

Fuel  Subsystems  Flight  Test  Handbook 

1981 

AFEWC-DR  1-81 

- 

Radar  Cross  Section  Handbook 

1981 

NATC-TM7 1-ISA226 

Hewett,  M.D. 
Galloway,  R.T. 

On  Improving  the  Flight  Fidelity  of  Operational  Flight/ 
Weapon  System  Trainers 

1975 

NATC-TM-TPS76-1 

Bowes,  W.C. 

Miller,  R.V. 

Inertially  Derived  Flying  Qualities  and  Performance 
Parameters 

1976 

NASA  Ref.  Publ.  1008 

Fisher,  F.A. 

Plumer,  J.A. 

Lightning  Protection  of  Aircraft 

1977 

NASA  Ref.  Publ.  1046 

Gracey,  W. 

Measurement  of  Aircraft  Speed  and  Altitude 

1980 

NASA  Ref.  Publ.  1075 

Kalil,  F. 

Magnetic  Tape  Recording  for  the  Eighties  (Sponsored  by: 
Tape  Head  Interface  Committee) 

1982 

The  following  handbooks  are  written  in  French  and  are  edited  by  the  French  Test  Pilot  School  (EPNER  Ecole  du 
Personnel  Navigant  d’Essais  et  de  Reception  ISTRES  -  FRANCE),  to  which  requests  should  be  addressed. 

Number 

EPNER 

Reference 

Author 

Title 

Price  (1983) 
French  Francs 

Notes 

2 

G.Leblanc 

L’analyse  dimensionnelle 

20 

Riidition  1977 

7 

EPNER 

Manuel  d’exploitation  des  enregistrements  d’Essais 
en  vol 

60 

6ime  Edition  1970 

8 

M.Durand 

La  micanique  du  vol  de  l’hilicopt&re 

155 

lere  Edition  1981 

12 

C.Laburthe 

Micanique  du  vol  de  1’avion  appliquie  aux  essais  en 
vol 

160 

Riidition  en  cours 

IS 

A.Hisler 

La  prise  en  main  d’un  avion  nouveau 

50 

lire  Edition  1964 

Number 

EPNER 

Reference 

Author 

TUlt 

hie*  (1983) 
French  francs 

Notes 

16 

Candau 

Programme  d’eiaaia  pour  revaluation  d’un  helicopters  20 

et  d'un  pilote  automatique  d’heiicoptere 

2ime  Edition  1970 

22 

Cattaneo 

Court  de  metro.ogie 

45 

RMdition  1982 

24 

G.Frayaae 

F.Coutaon 

Pratique  dea  eaaaii  en  vol  (en  3  Tomes) 

T  1  =  160 

T  2  -  160 
T3=  120 

lire  Edition  1973 

25 

EPNER 

Pratique  des  essais  en  vol  heiicoptire  (en  2  Tomes) 

T  1  -  150 

T  2  *  150 

Edition  1981 

26 

J.C.  Wanner 

Bang  sonique 

60 

31 

Tarnowaki 

Inertie-verticale-s6curite 

50 

lire  Edition  1981 

32 

B.Pennacchioni 

Aeroeiasticite  -  le  flottement  des  avions 

40 

lire  Edition  1980 

33 

C.Lelaie 

Let  vrilles  et  leun  essais 

110 

Edition  1981 

37 

S.Aller-  j 

Electricity  k  bord  des  adronefs 

100 

Edition  1978 

53 

J.C  .Wanner 

Le  moteur  d’avion  (en  2  Tomes) 

T  1  Le  rdacteur  .  n 

T  2  Le  turbopropul aeur . 

85 

85 

Reedition  1982 

55 

De  Cennival 

Installation  des  turbomoteurs  sur  heiicop  tires 

60 

2ime  Edition  1980 

63 

Gremont 

Aperfu  sur  les  pneumatiques  et  leun  propriet<s 

25 

34me  Edition  1972 

77 

Gremont 

L'atterrissage  et  le  problime  du  freinage 

40 

2ime  Edition  1978 

82 

Auffret 

Manuel  de  midecine  aironautique 

55 

Edition  1979 

85 

Monnier 

Conditions  de  calcul  des  structures  d’avions 

25 

lire  Edition  1964 

88 

Richard 

Technologic  hilicoptire 

95 

Reedition  1971 

6.  Tide 
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AGARDograph^bddresses  the  problem  of  estimating  parameters  of  dynamic  systems.  The  aim 
nt  the  theoretical  basis  of  system  identification  and  parameter  estimations  in  a  manner  that 
is  complete  and  rigorous,  yet  understandable  with  minimum  prerequisites.  It  concentrates  on 
maxunurnUkelihood  and  related  knowledge  of  stochastic  processes  or  functional  analysis.  No 
previous  background  in  statistics  is  assumed.  The  treatment  emphasizes  unification  of  the  various 
areas  in  estimamm  theory  and  practice.  For  example,  the  theory  of  estimation  in  dynamic  systems  is 
treated  as  a  direct  outgrowth  of  the  static  system  theory.  Topics  covered  include:  Basic  concept 

^numerical  optimization  methods;  probability;  statistical  estimators;  estimation  in  static 
systems;  stochastic  processes;  state  estimation  in  dynamic  systems;  output  error,  filter  error,  and 
equation  error  methods  of  parameter  estimation  in  dynamic  systems;  and  the  accuracy  of  the 
estimates.  •,  ■- 
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This  document  addresses  the  problem  of  estimating  parameters  of  dynamic  systems* 

The  aim  is  to  present  the  theoretical  basis  of  system  identification  and  parameter  esti¬ 
mation  in  a  manner  that  is  complete  and  rigorous,  yet  understandable  with  minimal  pre¬ 
requisites.  The  document  concentrates  on  maximum  likelihood  and  related .estimators . 

The  approach  used  requires  familiarity  with  calculus,  linear  algebra,  and  probability, 
jut  does  not  require (knowledge  Of  stochastic  processes  or  functional  analysis,  llo  pre¬ 
vious  background  in  statistics  is  assumed. 

The  treatment  emphasises  •unification  of  the  various  areas  in  estimation  theory  and 
practice.  For  example,  the  theory  of  estimation  in  dynamic  systems  is  treated  as  a 
direct  outgrowth  of  the  static  system  theory. 


Topics  covered  include  basic  concepts  and  definitional  numerical  optimisation 
methods i  probability i  statistical  estimators i  estimation  in  static  systems i  stochastic 
processes i  state  estimation' in  dynamic  systems |  output  error,  filter  error,  and  equation 
error  methods  of  parameter  estimation  in  dynamic  systasMi  and  the  accuracy  of  the 
estimates. 


